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NOTE: 

Accompanying  the  original  report  on  which  this  issue 
of  the  Research  Bulletin  is  based  was  a  7-inch  micro- 
groove  disc  which  contained  recorded  examples  of  the 
sounds  of  reading  machine  output.   Unfortunately  it 
is  not  possible  to  include  such  a  disc  with  each  copy 
of  the  Bulletin.   IRIS  will  undertake,  however,  on 
special  request  from  researchers,  to  copy  the  disc  on 
to  a  small  spool  of  magnetic  tape.   Requests  for  the 
tape  copy  should  be  sent  to  the  Editor. 


PREFATORY  NOTE 

The  Research    Bulletin   of  the  American  Foundation  for  the  Blind 
is  intended  to  be  a  means  of  publication  for  some  scientific 
papers  which,  for  a  variety  of  reasons,  may  not  reach  the  mem- 
bers of  the  research  community  to  whom  they  may  prove  most  use- 
ful or  helpful.   Among  these  papers  one  may  include  theses  and 
dissertations  of  students,  reports  from  research  projects  v;hich 
the  Foundation  has  initiated  or  contracted  for,  and  reports  from 
other  sources  which,  we  feel,  merit  wider  dissemination.   Only 
a  few  of  these  find  their  way  even  into  journals  v;hich  do  not 
circulate  widely;  others  may  never  be  published  because  of  their 
length  or  because  of  lack  of  interest  in  their  subject  matter. 

The  Research    Bulletin    thus  contains  both  papers  written 
especially  for  us  and  papers  previously  published  elsewhere.   The 
principal  focus  may  be  psychological,  sociological,  technological, 
or  demographic.   The  primary  criterion  for  selection  is  that  the 
subject  matter  should  be  of  interest  to  researchers  seeking  in- 
formation relevant  to  some  aspect  or  problem  of  visual  impair- 
ment; papers  must  also  meet  generally  accepted  standards  of 
research  competence. 

Since  these  are  the  only  standards  for  selection,  the  papers 
published  here  do  not  necessarily  reflect  the  opinion  of  the 
Trustees  and  staff  of  the  American  Foundation  for  the  Blind. 

The  editorial  responsibility  for  the  contents  of  the  Bulletin 
rests  with  the  International  Research  Information  Service  (IRIS) 
of  the  American  Foundation  for  the  Blind,  an  information  dissem- 
ination program  resulting  from  the  cooperative  sponsorship  of  the 
Foundation  and  certain  scientific  and  service  orqanizations  in 
other  countries.   In  the  United  States  financial  assistance  is 
provided  by  the  Vocational  Rehabilitation  Administration  of  the 
United  States  Department  of  Health,  Education,  and  Welfare,  and 
by  certain  private  foundations. 

Since  our  aim  is  to  maximize  the  usefulness  of  this  publi- 
cation to  the  research  community,  we  solicit  materials  from  every 
scientific  field,  and  we  will  welcome  reactions  to  published 
articles. 

M.  Robert  Barnett 
Executive  Director 
American  Foundation 
for  the  Blind 
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AN  INVESTIGATION  OF  AUDIO 
OUTPUTS  FOR  A  READING  MACHINE* 

P.W.  Nye 

Booth  Computation  Laboratory 
California  Institute  of  Technology 
Pasadena,  California 


INTRODUCTION 

The  need  for  research  into  reading  devices  for  the  blind  arises 
because  the  cost  of  producing  braille  books  is  high  and  conse- 
quently the  number  of  titles  available  is  limited.   Furthermore, 
braille  books  are  bulky  in  comparison  with  printed  books  and 
take  many  highly  skilled  man-hours  to  encode  and  produce.   The 
idea  of  avoiding  this  dilemma  by  producing  a  reading  machine 
which  would  convert  ordinary  print  into  an  auditory  or  kinesthe- 
tic-tactile  display  has  been  under  study  at  various  times  during 
the  past  50  years.   It  is  often  assumed  that  such  a  device,  if 
made  cheaply  enough  for  personal  ownership,  would  free  many  blind 
people  from  embarrassment  by  permitting  them  to  read  personal 
typewritten  correspondence,  and  would  also  enable  them  to  read 
books  and  daily  newspapers.   In  particular,  the  needs  of  the  blind 
student  wishing  to  read  specialized  books  not  available  in  braille 
are  also  raised  as  justifying  this  work. 

It  must  be  admitted,  however,  that  at  the  present  time  there 
is  very  little  factual  information  about  the  reading  needs  and 
abilities  of  blind  people  which  could  justify  these  assumptions. 
The  view  is  held  by  many  people  that  the  desire  among  the  blind 
to  read  newsprint  and  correspondence  is  less  than  is  popularly 
believed,  and  that  making  access  to  book  print  easier  should  be 
the  prime  objective.   Furthermore,  if  this  is  true,  the  proper 
approach  would  be  to  improve  the  availability  of  braille  by  work- 
ing to  reduce  bulk  and  production  costs  and  increase  the  speed  of 
translation.   This  argument  is  partly  supported  by  current  opin- 
ion of  the  potentialities  of  reading  machines.   It  is  clear  that 
the  types  of  reading  machine  that  could  be  produced  sufficiently 
cheaply  for  personal  ownership  are  likely  to  permit,  even  with 
extended  training,  reading  speeds  significantly  below  the  levels 
achieved  with  braille,  particularly  by  those  trained  to  read 
braille  from  early  childhood.   This  essentially  is  the  case 


*The  work  described  above  was  carried  out  in  the  Autonomics  Divi- 
sion of  the  National  Physical  Laboratory  on  behalf  of  St.  Dunstan's 
This  paper  is  published  by  permission  of  the  Acting  Director  of 
the  Laboratory  and  with  the  approval  of  St.  Dunstan's. 


against  work  on  reading  machines.   In  its  favor  is  the  argument 
that  to  give  blind  readers  access  to  the  same  reading  matter  as 
sighted  persons  is  easily  the  most  elegant  solution  both  adminis- 
tratively and  socially,  despite  its  other  limitations.   Moreover, 
in  the  absence  of  clear  evidence  on  the  matter,  it  is  quite  reason- 
able to  make  assumptions  about  the  reading  patterns  of  the  blind 
on  the  basis  of  the  experiences  of  sighted  readers.   These  would 
suggest  that  even  a  limited  performance  device  could  meet  the 
needs  of  a  sizeable  proportion  of  the  blind  population;  probably 
about  4  percent  (16).   It  is  hoped  that  in  the  near  future  the 
Ministry  of  Health  will  accept  a  proposal  which  has  been  prepared 
by  representatives  of  St.  Dunstan's  and  the  Medical  Research 
Council  to  sponsor  a  survey  of  the  current  reading  habits  of 
blind  persons  in  England  and  Wales.   Meanwhile,  in  the  absence 
of  this  human  data  on  which  to  base  research  objectives,  economic 
considerations  usually  predominate. 

There  are  broadly  two  kinds  of  potential  reading  machine. 
The  first  is  the  direct  coding  type  which  converts  printed  text, 
scanned  by  a  photoelectric  sensing  device,  into  sound  patterns 
bearing  some  relationship  to  the  letter  shapes.   The  reader  has 
to  learn  to  recognize  the  sounds  and  to  associate  them  with  the 
printed  letters.   The  second  possible  machine  differs  from  the 
first  in  that  the  machine  itself  is  responsible  for  recognition 
at  the  letter  level  and  could  be  made  to  produce  spelled  speech 
sounds  or  connected  speech  at  the  output.   Both  the  recognition 
machine  and  the  direct  coding  reader  would  be  required  to  function 
satisfactorily  with  a  wide  range  of  print  style  and  quality  vari- 
ants.  The  technical  problems  of  designing  the  recognition  machine 
for  this  standard  of  performance  will  take  many  years  to  solve  and 
in  the  initial  stages  the  machine  will  certainly  be  costly  and 
cumbersome.   But  the  machine  would  have  the  advantage  that  its 
output  could  be  made  to  closely  resemble  ordinary  speech  and  a- 
void  many  of  the  difficulties  of  learning  and  the  slow  reading 
speeds  that  beset  the  simpler  reading  devices. 

It  is  quite  clear  that  a  machine  producing  ordinary  speech 
will  be  the  best  solution  to  this  problem,  but  at  the  moment  the 
economic  and  technical  difficulties  are  such  as  to  make  it  unlike- 
ly that  a  machine  can  be  produced  for  personal  ownership  for  many 
years.   It  is  possible  that,  as  an  interim  measure,  some  modifi- 
cations could  be  made  to  the  direct  coding  type  of  machine  which 
would  enable  it  to  partially  meet  the  demand  until  a  more  satis- 
factory machine  is  made  available. 

A  considerable  amount  of  work  has  been  carried  out  in  the  USA 
and  USSR  on  variants  of  the  Fournier  d'Albe  optophone,  which  is 
the  classic  direct  coding  machine  (see  Appendix  A) .   The  results 
of  all  these  recent  investigations  have  strongly  confirmed  that 
the  system  has  inherent  limitations  which  restrict  the  reading 
speed  to  between  10  and  20  words  per  minute  (wpm)  for  even  the 


best  pupils.   Cooper  (5)  and  Beurle  (4)  have  suggested  reasons 
for  the  poor  performance  of  the  optophone  and  the  observations 
discussed  below  incorporate  a  number  of  their  conclusions.   The 
present  study  has  been  concerned  principally  with  the  audible  dis- 
play and  has  concentrated  on  examining  the  effects  of  modifying 
three  characteristics  of  the  optophone  output.   These  have  been 
suggested,  by  earlier  studies,  to  be  the  underlying  cause  of  its 
poor  performance. 

1)  The  optophone  transforms  two-dimensional  characters 
into  the  acoustic  dimensions  of  frequency  and  time. 
Experiments  have  been  carried  out  to  examine  the  effects 
of  increasing  the  number  of  dimensions. 

2)  Printed  characters  contain  a  considerable  amount  of 
visual  redundancy  which  appears  in  the  optophone  out- 
put in  a  form  which  cannot  be  utilized  by  the  ear. 
The  effects  of  removing  some  of  this  redundancy  from 
the  display  has  been  explored. 

3)  A  widely  held  theory  of  speech  perception  holds  that 
better  discriminations  can  be  made  among  certain 
speech  sounds  than  nonspeech  sounds.   This  suggests 
that  to  achieve  the  best  results  from  the  human  ear 
the  output  of  a  reading  machine  should  be  mimicc±>le. 

An  apparatus  incorporating  all  these  modifications  has  been 
simulated  and  the  results  of  subjective  tests  compared  with  those 
of  earlier  workers.   This  device,  which  is  first  and  foremost  a 
research  tool,  lies  somewhere  midway  between  the  direct  coding 
and  the  recognition  machines.   To  create  a  practical  machine  con- 
sideraible  reorganization  of  the  design  and  some  further  work  on 
the  basic  concepts  will  be  necessary,  but  the  results  of  this  re- 
search show  that  progress  in  the  direction  of  a  compromise  between 
direct  coding  and  recognition  can  be  made.   The  outstanding  ques- 
tion will  be  to  what  extent  design  simplicity  must  be  sacrificed 
to  achieve  adequate  reading  speeds. 

The  three  factors  considered  in  this  study  are  all  features 
which  distinguish  the  outputs  of  direct  coding  devices  from  speech. 
There  is  a  fourth  factor  which  can  certainly  affect  performance. 
In  speech,  the  structure  of  a  sentence  can  be  conveyed  by  varia- 
tions in  the  pitch  and  stress  of  the  voice,  whereas  the  reading 
machine  effectively  talks  in  a  monotone.   In  the  case  of  sentences 
having  a  simple  structure  th:  s  is  unlikely  to  be  a  serious  disad- 
vantage but  in  the  case  of  more  complex  constructions  greater  dif- 
ficulties are  likely  to  arise.   Sentences  of  the  type,  "The  family, 
the  woman  we  met  yesterday  told  us  about,  is  leaving  tomorrow.", 
have,  in  Yngve's  (21)  terms,  a  'regressive  structure'  in  which  the 
subject  of  each  clause  has  to  be  memorized  until  its  predicate  is 
reached.   The  stress  and  pitch  of  the  speaker's  voice  customarily 


gives  the  listener  information  about  the  location  of  the  clauses 
which  would  be  almost  entirely  absent  in  the  output  of  a  reading 
machine.   The  effect  of  this  loss  of  information  can  only  be  a 
reduction  in  reading  speed.   Insufficient  time  has  been  available 
in  this  study  to  find  out  the  likely  loss  of  efficiency  with  dif- 
ferent kinds  of  reading  matter  but  it  seems  probable  that  book 
print  would  suffer  most  severely. 

BACKGROUND  REVIEW 

The  problem  of  specifying  a  good  design  for  a  reading  machine 
can  be  neatly  stated  as  that  of  finding  an  optimum  transformation 
of  printed  information  from  the  visual  to  the  auditory  modality. 
Expressed  in  this  way,  the  problem  appears  precise  and  deceptive- 
ly  straightforward,  but  it  raises  a  large  number  of  unanswered 
questions.   A  man  reading  aloud  is  carrying  out  such  a  transfor- 
mation, but  natural  speech  may  not  be  the  only  auditory  display 
that  could  facilitate  this  optimum  performance.   There  may  be  many 
other  ways  in  which  information  from  the  printed  page  could  be  op- 
timally transformed.   To  be  certain  that  other  methods  exist,  we 
would  require  a  general  working  model  of  the  human  processes  in- 
volved in  the  evolution  of  alphabets,  in  reading  and  in  speech. 
With  this  knowledge,  we  could  hope  to  deduce  whether  or  not  the 
class  of  displays  called  speech  are  alone  the  most  efficient  means 
of  achieving  the  transformation  and,  if  so,  precisely  how  much 
poorer  a  performance  one  should  expect  from  nonspeech  displays. 
However,  little  or  none  of  this  relevant  data  exists  and,  in  the 
absence  of  such  information,  only  limited  attempts  toward  devel- 
oping some  kind  of  model  have  been  possible.   Theories  in  this 
field  and  some  of  the  'facts'  are  in  many  respects  speculative 
and  incomplete,  and  because  a  rapid  solution  is  required  as  an 
interim  measure  the  approach  to  the  work  has  been  only  in  part 
pragmatic  and  in  the  remainder  intuitive.   However,  where  some 
concept  has  influenced  certain  activities  and  objectives  in  this 
study,  an  attempt  has  been  made  in  the  following  sections  to 
describe  the  relevant  parts  of  this  model. 

One  reason  why  reading  with  an  optophone  is  so  slow  lies  in 
the  response  time  of  the  human  ear.   When  the  repetition  rate  is 
low,  clicks  and  other  short  stimuli  are  heard  as  separate  events; 
as  the  rate  increases,  the  sound  changes  to  a  buzz  at  about  20  cps 
and  then  to  a  tone  of  rising  pitch.   Even  if  the  stimuli  are  not 
the  same,  their  individuality  is  lost  at  20  cps.   This  indicates 
that  there  is  a  maximum  repetition  rate  for  the  reception  of  dis- 
crete stimuli  which  is  limited  to  ebout  15  per  second.   The  fig- 
ure can  explain  the  limiting  rate  for  the  reception  of  Morse  code. 
On  average,  Morse  symbols  contain  approximately  three  elements 
per  letter  and,  taking  the  average  word  length  in  English  to  be 
4-1/2  characters,  gives  an  estimated  maximxim  of  65  words  per 
minute,  which  agrees  very  closely  with  the  best  recorded  reading 
speeds  with  the  Morse  code.   A  similar  calculation  for  the  opto- 


phone  based  upon  the  need  to  discriminate  five  chords  per  letter 
gives  a  somewhat  lower  figure,  again  agreeing  with  the  best 
reading  speeds  recorded  by  Beurle  (4).   Applying  the  same  prin- 
ciples to  spoken  English,  with  syllables  being  regarded  as  the 
individual  stimuli,  gives  as  expected,  a  much  higher  speed. 

At  a  higher  level  in  the  information  processing  system  of 
our  brains  there  is  another  restriction  presented  by  the  limited 
span  of  immediate  memory.   Miller  (14)  has  described  evidence  on 
which  he  has  based  the  hypothesis  that  our  memory  span  for  infor- 
mation is  limited  to  about  seven  independent  chunks  of  informa- 
tion, irrespective  of  the  information  content  of  each  chunk.   He 
has  also  suggested  that  the  structure  of  language  is  evidence  of 
the  way  we  require  to  chunk  or  quantize  information  in  terms  of 
words,  phrases,  sentences,  etc.,  and  that  this  hierarchical  struc- 
ture indicates  the  way  in  which  information  has  to  be  processed 
by  our  brains.   Thus  if  the  reading  rate  is  slowed  down  by  the 
response  time  of  the  ear  or  other  factors  the  short  term  memory 
system  is  employed  inefficiently  and  the  beginning  of  a  sentence 
or  paragraph  can  be  easily  forgotten  before  the  end  is  reached. 
The  reading  process  therefore  becomes  extremely  tedious  and  fa- 
tiguing. 

Clearly,  if  a  given  passage  of  text  is  spoken,  encoded  into 
Morse  or  scanned  by  an  optophone,  the  number  of  elements  in  the 
spoken  version  will  be  much  fewer  than  in  either  Morse  or  the 
optophone  code,  although  the  aunount  of  useful  information  con- 
veyed by  all  three  media  will  be  the  same.   The  reason  for  this 
difference  stems  from  the  fact  that  there  is  a  high  degree  of 
redundancy  in  printed  characters.   This  is  illustrated  by  the 
familiar  experiment  of  masking  the  lower  half  of  a  line  of  print 
whereupon  it  can  usually  still  be  read  with  very  little  diffi- 
culty.  Much  of  the  redundancy  can  be  utilized  by  the  eye  and 
helps  to  minimize  error,  but,  via  direct  encoding,  it  is  not  ac- 
ceptable to  the  ear,  which  has  a  lower  information  capacity. 
When  reading  aloud  the  speaker  carries  out  a  complex  process  in 
which  he  removes  the  original  redundancy  of  the  print  pattern 
and  reproduces  the  essential  information  with  redundancy  of  a 
different  kind  that  the  ear  can  utilize.   The  presence  of  redun- 
dancy in  speech  is  indicated  by  its  ability  to  resist  many  kinds 
of  distortion  and  still  remain  intelligible. 

One  source  of  redundancy  in  print  derives  from  the  fact 
that  letters  are  composed  of  continuous  lines  and,  if  given  a 
partially  masked  character,  the  invisible  portion  may  be  guessed 
with  small  risk  of  error  by  inspecting  the  trajectories  of  lines 
at  the  boundary  of  the  mask.   In  the  case  of  the  optophone,  ob- 
servation shows  that  the  ear  can  make  only  very  limited  use  of 
the  sequential   redundancy  contained  in  successive  chords  and, 
if  information  is  lost,  it  is  much  more  difficult  to  guess  cor- 
rectly.  The  eye,  having  access  to  all  parts  of  the  two  dimension- 
al pattern  simultaneously,  is  capable  of  utilizing  the  redundancy 


in  print;  but  the  ear,  receiving  this  same  information  serially, 
is  not.   This  is  not  to  suggest  that  the  ear  is  unable  to  utilize 
sequential  redundancy  at  ail,  for  at  higher  levels,  involving 
larger  units  of  spoken  text,  sequential  redundancy  does  play  a 
part,  but  the  rate  at  which  the  decisions  have  to  be  made  is,  by 
comparison,  extremely  low.   The  kind  of  redundant  signal  that 
the  ear  is  accustomed  to  use  is  formed  by  duplicating  the  same 
information  among  several  dimensions  in  the  acoustic  pattern. 
Thus,  in  the  absence  of  information  defining  some  signal  property 
along  any  one  dimension,  there  is  a  high  probability  that  the  same 
information  can  be  obtained  by  the  ear  from  other  dimensions  in 
the  display.   It  will  be  clear  from  this  discussion  that  the  no- 
tion of  direct  coding  must  undergo  some  radical  modifications 
aimed  first  at  achieving  a  transformation  of  print  which  approach- 
es more  closely  the  principles  involved  in  reading  and,  second, 
to  ensure  that  the  number  of  signal  elements  per  word  is  reduced 
to  syllable  proportions. 

Pollack  (17,  18)  has  carried  out  experiments  using  sounds 
with  different  numbers  of  physical  dimensions  (e.g.,  frequency, 
loudness,  rate  of  interruption,  etc.).   He  has  shown  that  multi- 
dimensional signals  can  convey  more  information  per  signal  ele- 
ment than  signals  employing  only  one  dimension.   Thus  increased 
dimensionality  can  be  used  not  only  to  afford  some  duplication  of 
information  but  also  to  carry  more  information  per  signal  element. 
However,  Pollack's  subjects  recorded  their  discriminations  in 
their  own  time  by  a  check  list  procedure.   This  meant  that  for 
the  multidimensional  stimuli  containing  more  information,  it  was 
possible  that  a  process  of  memorization  followed  by  serial  identi- 
fication of  the  dimensions  made  the  recognition  time  proportional- 
ly longer.   The  importance  of  this  point  to  the  problem  of  speci- 
fying the  output  of  a  reading  machine  stimulated  some  work  to 
determine  whether  multidimensional  sounds  did  carry  a  penalty  in 
recognition  time. 

In  an  outline  of  the  psychological  considerations  that  con- 
trol the  design  of  a  reading  machine  output  Studdert-Kennedy  and 
Liberman  (19)  regard  speech  as  permitting  the  highest  attainable 
speeds  of  auditory  communication  and  proceed  to  analyze  the  fac- 
tors which  give  it  this  high  efficiency.   They  state  that  there 
are  good  reasons  for  believing  that  there  is  more  to  the  percep- 
tion of  speech  than  the  fact  that  it  is  multidimensional  and  that 
discrete  units  of  information  (the  syllables)  fall  within  the  re- 
solution time  of  the  ear.   The  authors  describe  certain  'discrep- 
ancies' that  occur  in  the  perception  of  speech  stimuli  which  they 
suggest  may  be  explained  by  a  theory  of  speech  perception  re- 
ported in  their  publications.   This  theory  states  that  the  per- 
ception of  speech  is  tightly  linked  to  the  feedback  from  the  lis- 
tener's own  speech  mechanism  and  that  we  discriminate  at  least 
some  speech  sounds  by  monitoring  the  nervous  activity  necessary 
to  imitate  the  incoming  speech  patterns.   In  a  series  of  papers 


Liberman,  Cooper,  Fry,  Eimas ,  and  others  (6,  7,  12)  have  reported 
the  effects  of  a  gradual  movement  of  the  second  formant  transi- 
tion in  a  series  of  synthesized  phonemes  (Figure  1).   These  sounds, 
which  cover  the  consonants  /b/  /d/  and  /g/ ,  are  not  heard  as  a 
series  of  gradually  changing  stimuli  but  as  a  series  of  identical 
/b/'s,  followed  by  an  abrupt  change  to  a  series  of  identical  /d/'s, 
which  again  shifts  abruptly  to  a  series  of  /g/'s.   Discrimination 
is  shown  to  be  sharp  across  the  phoneme  boundaries  but  to  be  very 
poor  within  phoneme  categories.   In  contrast,  corresponding  dis- 
criminations of  a  similar  number  of  stimuli  spanning  the  three 
vowels  /I/  /e/  and  /ae/  do  not  show  the  same  sharp  categorization 
and  differential  judgments  (in  which  adjacent,  or  near  adjacent, 
stimuli  are  presented  in  rapid  succession)  do  produce  precise  re- 
sults uniformly  throughout  the  range.   The  theory  suggests  that 
the  discontinuities  that  occur  in  the  perception  of  the  consonants 
reflect  the  different  articulations  that  are  necessary  to  produce 
them.   Thus  /b/  is  produced  by  a  movement  of  the  lips  and  /d/  by 
a  movement  of  the  tongue.   Unlike  the  vowels,  there  is  no  way  in 
which  a  continuum  of  consonant  sounds  can  be  produced  between  /b/ 
and  /d/.   The  articulary  movements  are  discrete  and  so  are  the 
perceptions . 

If  there  is  some  feedback  mechanism  of  discrimination  which 
makes  possible  preferential  responses  to  certain  classes  of  speech 
sounds,  then  to  achieve  the  best  possible  performance  from  a  read- 
ing machine  it  is  clearly  desirable  to  use  some  form  of  analogue 
speech  mechanism  which  produces  these  sounds.   In  view  of  the  im- 
pact of  the  theory  on  our  research  it  has  been  thought  worthwhile 
to  examine  the  evidence  further. 

The  theory  is  intuitively  plausible,  but  it  has  the  unsatis- 
factory drawback  that  it  is  peculiarly  difficult  to  design  ex- 
periments that  would  really  put  it  to  the  test.   If  examples  of 
continuous  perceptions  in  speech  could  be  found,  in  situations 
where  the  sounds  were  produced  by  distinct  articulations ,  only 
then  could  the  theory  be  falsified.   This  appears  to  be  the  only 
direct  method  by  which  it  could  be  tested  satisfactorily.   If  the 
nervous  feedback  from  the  articulators  were  monitored  by  means  of 
suitably  positioned  electrodes  it  seems  possible  that  clear  evi- 
dence could  be  gathered  that  would  confirm  or  deny  the  theory, 
but  in  practice  the  techniques  are  difficult  and,  if  no  signals 
were  detected,  this  result  would  provide  insufficient  justifica- 
tion for  rejecting  the  theory.   Indirect  methods  of  examination 
involve  setting  up  an  alternative  hypothesis  and  showing  that  the 
observations  are  consistent  with  the  action  of  an  entirely  dif- 
ferent mechanism.   Thus  we  could  postulate  that  the  discrimination 
effects  are  learned  responses  which  could  be  achieved  with  non- 
speech  stimuli  if  sufficient  time  were  devoted  to  training.   How- 
ever, the  question  of  training  time  is  crucial  and  will  influence 
the  results  of  any  experiment  to  test  this  hypothesis,  for  it  is 
unlikely  that  laboratory  subjects  could  be  trained  for  long  enough 
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to  enable  them  to  compete  with  speech  discriminations  at  the  same 
level  of  efficiency.  Thus  the  results  are  unlikely  to  be  conclu- 
sive. 

There  is  one  other  approach  to  testing  the  feedback  theory 
which  does  not  suffer  directly  from  the  training  difficulty.   This 
approach  involves  an  investigation  of  whether  the  critical  differ- 
ences in  discrimination  can  be  found  for  sequences  of  non-mimica- 
ble  sounds  which  are  distinguished  from  one  another  by  features 
analogous  to  those  found  in  the  synthesized  speech  stimuli.   This 
point  is  examined  in  more  detail  in  the  following  section. 

There  is  an  assumption  in  the  methodology  of  the  experiments 
with  synthetic  speech  stimuli  that  both  series  of  sounds  are  dis- 
tributed along  single  dimensions  in  the  perceived  signal  space 
and  that  there  is  not  a  prior  reason  why  we  should  expect  the  re- 
sults of  absolute  discriminations  on  /b/  /d/  or  /g/  to  be  any  dif- 
ferent from  those  on  /I/  /e/  or  /ae/.   In  fact,  from  an  inspection 
of  the  physical  differences  between  the  stimuli  we  certainly 
should  be  led  to  expect  some  differences  in  the  responses.   First, 
the  vowel  stimuli  have  a  single  phase  in  which  components  which 
distinguish  them  from  other  vowels  are  sustained  for  300  msec 
while  the  consonants  contain  two  phases,  the  transient  of  the  sec- 
ond formant  and  its  subsequent  stationary  state  during  the  follow- 
ing vowel.   Although  the  total  length  of  the  consonant-like  stimu- 
li is  also  300  msec,  the  portion  of  any  one  stimulus  that  distin- 
guishes it  from  all  others  is  only  75  msec  in  length.   Taking  in- 
to consideration  the  fact  that  the  power  in  the  second  formant  is 
8  db  below  that  contained  in  the  first,  it  is  in  consequence  ex- 
tremely difficult  for  the  ear  to  detect  rates  of  change  of  formant 
two  differing  by  several  hundred  cps .   This  may  explain  the  poor 
relative  judgments  that  are  made  with  stop  consonant  stimuli  un- 
der ABX  presentation  procedures.   Categorization  judgments  on  the 
vowel  sounds  are  therefore  carried  out  under  entirely  different 
conditions  to  those  for  consonants.   Genuine  conditions  for  abso- 
lute judgments  can  prevail  for  vowels  but  classification  of  the 
consonants  can  be  assisted  by  judging  the  trajectory  of  the  75- 
msec  transient  relative   to  the  position  of  the  following  225-msec 
vowel.   Thus,  the  three  states  of  the  second  formant  transition, 
the  descending,  continuous,  and  ascending  phases,  automatically 
define  three  categories  for  the  observer.   To  summarize,  an  exam- 
imation  of  the  physical  differences  between  the  stimuli  can  provide 
an  explanation  of  the  intercategory  similarities  experienced  with 
the  stop  consonants  and  can  suggest  some  reasons  for  the  sharpness 
of  the  category  boundaries.   However,  these  considerations  are 
not  sufficient  in  themselves  to  explain  all  the  observations  and 
it  is  necessary  to  make  the  additional  proposal,  that  language 
training  and  experience  help  to  sharpen  perception  in  the  regions 
of  the  phoneme  boundaries.   If  the  combination  of  physical  factors 
and  the  effects  of  learning  are  indeed  the  determinants  of  the 
observations,  then  there  is  no  reason  why  any  nonspeech  sounds 


could  not  be  utilized  to  form  a  communication  medium  as  flexible 
as  speech  itself.   Some  preliminary  experiments  have  been  con- 
ducted to  gather  evidence  bearing  upon  those  conclusions. 

SPEED  OF  DISCRIMINATION  FOR 
MULTIDIMENSIONAL  SIGNALS 

Pollack  and  Ficks  were  able  to  show  that,  as  the  number  of  inde- 
pedent  characteristics  (dimensions)  of  a  stimulus  are  increased, 
the  amount  of  information  transmitted  per  stimulus  can  also  rise. 
But,  as  these  writers  themselves  have  pointed  out,  to  be  able  to 
judge  the  relevance  of  this  observation  in  understanding  auditory 
communication,  "it  is  necessary  to  see  whether  the  transmission 
of  information  per  unit  time  can  likewise  be  appreciably  increased. 
If  increased  discrimination  speed  could  be  shown  to  result  from 
the  use  of  a  multidimensional  display,  this  would  help  explain 
the  superior  performance  of  speech. 

*"  A  direct  test  of  whether  multidimensionality  affects  the  rec- 
ognition time  for  a  stimulus  could  be  carried  out  by  comparing  a 
unidimensional  code  with  a  code  having,  say,  five  independent  vari- 
ables.  Ideally  the  experiment  should  involve  the  discrimination 
of  a  sufficient  number  of  stimuli  so  that  the  recognition  time 
may  be  of  the  same  order,  or  longer,  than  the  reaction  time  of  the 
response.   The  simplest  variables  to  use  in  such  an  experiment 
would  be  frequency,  intensity,  noise,  duration,  and  modulation. 
It  has  been  shown  by  Garner  (8)  and  Pollack (17)  and  others  that 
the  number  of  absolute  judgments  that  a  subject  is  capable  of 
making  along  any  one  of  these  dimensions  is  limited  to  about  six. 
To  furnish  a  sufficient  number  of  distinguishable  variables  to 
satisfy  the  recognition  time  requirements,  at  least  two  dimensions 
would  have  to  be  used  together.   Here,  frequency  and  intensity 
are  the  two  most  convenient  dimensions.   However,  to  achieve  an 
adequate  number  of  discriminable  steps,  the  intensity  dimensions 
requires  a  dynamic  range  of  more  than  9  5  db ,  a  requirement  which 
would  exclude  the  use  of  a  tape  recorder.   But,  because  of  its  con- 
venience and  the  need  for  reproducibility,  the  tape  recorder  is 
an  essential  tool  and  therefore  in  the  experiment  actually  carried 
out  compromises  were  made  and  the  ideal  choice  of  dimensions  modi- 
fied. 

The  dimensions  that  were  finally  chosen  are  shown  in  Figure  2. 
Some  of  these  variables  had  been  used  earlier  by  Beddoes ,  Belyea, 
and  Gibson  (3)  to  form  the  output  of  a  proposed  reading  machine 
employing  letter  recognition.   These  authors  carried  out  recogni- 
tion tests  on  a  number  of  dimensions,  used  together  and  in  isola- 
tion, to  establish  relative  independence  and  selected  frequency, 
modulation,  and  noise.   To  these  we  initially  added  time  and  in- 
tensity, but  time  was  later  abandoned  because  the  recognition  ac- 
curacy along  this  dimension  was  found  to  depend  significantly  up- 
on the  presence  of  noise.   The  variable  finally  chosen  to  replace 
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time  was  direction  of  origin  (i.e.,  left  or  right  ear).   All 
dimensions  were  tested  singly  and  in  pairs  to  ensure  that  the  di- 
mensions were  equally  discriminable. 

The  two  codes  comprised  one  of  five  dimensions  (5D)  and  a- 
nother  of  three  (3D)  ,  constructed  in  such  a  way  that  both  codes 
had  approximately  the  same  information  content.   The  intensity 
levels  used  in  the  5D  code  were  0  db  and  -12  db ,  and  those  for  the 
3D  code  were  0  db ,  -10  db ,  and  -20  db ;  the  maximum  intensity  lev- 
el ,  200  db,  was  adjusted  arbitrarily  by  each  subject.   All  sig- 
nals were  tape  recorded  giving  a  noise  level  at  about  -40  db  rela- 
tive to  the  maximum  signal  intensity.   The  noise  dimension  was 
achieved  by  the  addition  of  a  separate  noise  source,  raising  the 
background  noise  level  to  -12  db.   Double  sideband  modulation  giv- 
ing either  a  square  or  sinusoidal  envelope  formed  two  states  of  the 
modulation  dimension  while  the  third  was  an  unmodulated  pure  wave- 
form. 

Two  groups  of  4  subjects  aged  between  18  and  25  years  were 
recruited  and  matched  as  closely  as  possible  on  the  basis  of  aca- 
demic ability.   Each  group  began  a  course  of  training  on  letter 
recognition  using  one  of  the  two  codes.   The  training  method  in- 
volved the  division  of  the  alphabet  into  three  approximately  equal 
sections  which  were  learned  separately.   The  transition  between 
one  section  and  the  next  being  made  when  the  slowest  member  of  the 
two  groups  had  reached  the  60  percent  correct  level.   When  each 
section  of  the  alphabet  had  been  studied  in  this  way  the  training 
on  the  complete  alphabet  was  begun.   The  average  performance  of 
the  tvv^o  groups  during  training  was  monitored  and  is  shown  in  Fig- 
ure 3  for  the  nine  sessions  immediately  prior  to  the  three  final 
test  sessions  (Nos.  10,  11,  and  12).   Each  session  lasted  for  be- 
tween 15  and  20  minutes  and  was  divided  into  four  phases: 

1.  earphone  balance  and  adjustment 

2.  short  refresher  course 

3.  practice  under  test  conditions 

4.  recognition  test  of  40  symbols. 

The  subjects  attended  regularly  during  a  period  of  two  months 
and  during  that  time  accumulated  a  total  of  five  hours  experience 
of  the  codes.   The  signals  were  all  0.6  sec  in  length  and  were 
followed  by  a  3-sec  interval  during  which  the  subject  made  his 
verbal  response.   Measurements  were  made  of  the  time  elapsing  be- 
tween the  beginning  of  the  signal  and  the  beginning  of  the  re- 
sponse for  each  of  120  symbols  selected  randomly  from  the  full  al- 
phabet.  All  these  results  have  been  histogrammed  in  Figure  4. 
The  distributions  are  not  significantly  changed  if  only  the  cor- 
rect responses  are  considered.   The  complete  set  of  response  times 
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were  therefore  pooled  for  each  subject  and  group  and  the  mean  and 
variances  calculated  in  Table  1. 


TABLE  1 

GrouD 

Grouo 

Group 

Subject 

Variance 

Mean 

Variance 

Mean 

t 

K.F. 

0.34 

2.24 

A.C. 

0.25 

1.95 

3D 

M.W. 
P.P. 

0.  39 
0.61 

2.53 
2.63 

0.48 

2.34 

4  .  9 

R.C-H. 

0.18 

1.32 

J.K. 

0.22 

2.16 

5D 

R.H. 
K.H. 

0.57 
0.61 

2.72 
2.18 

0.64 

2.10 

An  information  analysis  (13)  of  the  three  test  sessions  re- 
vealed an  average  transmission  of  3.30  bits/symbol  from  a  possi- 
ble 4.57  bits/symbol  for  the  5D  code  compared  with  2.6  0  bits/sym- 
bol from  a  possible  4.62  bits/symbol  for  the  3D  code.   The  infor- 
mation contributed  by  individual  dimensions  is  shown  in  Table  2. 


TABLE  2 


Group 


Dimensions 


Input 
Information 


Transmitted 
Information 


3D 


Frequency 
Intensity 
Modulation 


1.58  bits 
1.58  bits 
1.58  bits 


0.74  bits 
0.43  bits 
0.  81  bits 


5D 


Frequency 

Intensity 

Modulation 

Noise 

Direction 


1.00  bits 

1.00  bits 

1.00  bits 

0.93  bits 

0.97  bits 


0.47  bits 

0.55  bits 

0.55  bits 

0.65  bits 

0.61  bits 


The  average  response  time  to  the  5D  code  was  shorter  by  0.24 
sec  than  the  corresponding  time  for  the  3D  code,  this  difference 
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being  significant  to  better  than  the  0.001  level  of  probability. 
However,  an  insufficient  number  of  subjects  were  available  to 
permit  the  elimination  of  individual  variations  from  the  results 
and  hence  it  is  not  possible  to  show  directly  that  the  differ- 
ences between  the  performance  of  the  two  groups  did  not  arise 
from  this  source.   Nevertheless,  there  is  no  indication  that  the 
5D  code  gave  the  slower  performance  expected  from  a  check  list 
discrimination  procedure. 

It  should  be  noted  that  the  five-dimensional  symbols  were 
identified  more  accurately.   The  plateau  in  the  learning  curve. 
Figure  3,  for  the  5D  code  occurs  at  70  percent  correct  while  the 
corresponding  plateau  for  the  3D  code  is  at  the  49  percent  cor- 
rect level. 

Hick's  (9)  results  indicate  that,  in  a  signal-response  sit- 
uation similar  to  our  own,  speed  and  accuracy  can  be  exchanged. 
Therefore,  if  the  amount  of  information  transmitted  by  the  two 
codes  were  made  equal,  response  times  to  the  5D  code  could  be 
decreased  by  an  amount  proportional  to  the  difference  in  infor- 
mation transmitted  by  the  two  codes,  i.e.,  proportional  to 
(3.30  -  2.60)  bits.   This  would  result  in  a  further  increase  of 
the  difference  in  response  time  between  the  two  codes.   The  re- 
sults of  this  experiment  therefore  strongly  indicate  that  any 
increase  in  the  number  of  perceivable  dimensions  in  a  display 
can  increase  the  rate  of  transmission  for  discrete  signals  and 
almost  certainly  increase  the  continuous  transmission  rate  as 
well. 

THE  SIMULATION  OF  DIFFERENT 
OUTPUT  DISPLAYS 

Before  proceeding  further,  it  was  necessary  to  construct  a  flex- 
ible system  which  would  provide  a  range  of  different  outputs  for 
experimental  use.   The  flexibility  arose  from  the  fact  that  the 
range  of  outputs  could  be  controlled  by  two  kinds  of  input  sig- 
nals; the  first  derived  from  digital  information  gathered  by  a 
single  row  of  photocells  scanning  the  characters  and  the  second 
from  a  more  complex  system  which  endeavored  to  carry  out  feature 
detection  on  curves  and  lines  and  so  reduce  this  source  of  re- 
dundancy.  Many  different  interconnections  of  these  input  and  out- 
put units  were  possible  but  only  those  connections  which  provided 
outputs  relevant  to  the  questions  under  study  were  used.   A  con- 
siderable portion  of  the  system  was  built  from  laboratory  optical 
and  electronic  equipment  and  the  remainder  simulated  on  a  digital 
computer.   The  computer  program  was  written  with  the  aim  of  a- 
chieving  certain  specific  objectives  in  the  simplest  and  most 
convenient  way,  and  without  regard  to  exploring  any  particular 
design  philosophy.   None  of  the  equipment  was  therefore  construc- 
ted with  an  eye  on  economy.   In  essence,  the  strategy  adopted 
was  to  build  an  apparatus  which,  in  one  form,  produced  an  output 
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conforming  in  a  niomber  of  ad  hoc  ways  to  the  principles  demanded 
by  certain  conceptual  models ,  and  to  measure  the  extent  to  which 
the  rejection  of  one  or  more  of  these  principles  (by  using  other 
input-output  relationships)  affected  the  overall  performance  of 
experimental  subjects. 

The  basic  input  to  the  system,  from  the  printed  page,  was 
provided  in  the  early  stages  by  an  optical  scanning  device  shown 
in  Figure  5  and  described  in  Appendix  B.   Figure  6  shov/s  the 
output  from  this  apparatus  recorded  on  three  different  occasions. 
It  was  found  that,  to  achieve  consistent  results,  the  mechanical 
alignment  had  to  be  maintained  within  close  limits.   The  plan  to 
use  this  equipment  to  generate  new  data  when  required  was  aban- 
doned as  a  direct  result  of  this  experience,  but  the  clearest 
scan  patterns  of  individual  letters  of  the  alphabet  were  sal- 
vaged and  used  as  a  standard  source  from  which  all  subsequent 
digitized  text  material  was  assembled  by  hand.   These  patterns 
were  then  either  degraded  into  a  six  bits  per  scan  form  closely 
equivalent  to  the  optophone,  or  were  processed  directly  by  a 
simple  feature  analysis  program  in  a  digital  computer.   The  ob- 
ject of  this  analysis  was  to  effect  a  reduction  of  the  amount  of 
binary  data  in  the  source  patterns  while  essentially  retaining 
the  important  textual  information. 

On  average  each  character  was  represented  on  a  12  by  5  bit 
matrix;  for  the  letter  m,  the  widest  character,  the  size  of  the 
matrix  was  12  by  9  bits.   Connected  text  was  assembled  with  one 
scan  of  zero  digits  inserted  to  separate  each  character  from  its 
neighbor  and  five  blank  scans  were  used  to  separate  words. 

The  feature  analysis  program  was  a  highly  simplified  ver- 
sion of  the  technique  described  by  Uyehara  (20)  and  functioned 
in  three  stages.   The  first  stage  consisted  of  an  examination  of 
the  first  column  of  digits  and  identification,  by  comparison  with 
a  mask,  of  ascenders,  ribbon  height  verticals,  and  descenders 
(Figure  7) .   Stage  two  involved  the  storing  of  this  column  and 
the  comparison  of  the  next  scan  line  or  column  with  the  store 
contents  to  determine  whether  there  were  digits  present  in  iden- 
tical positions  in  both  columns  -  indicating  the  presence  of  a 
horizontal  line  -  or  whether  digits  were  displaced  above  or  below 
one  another  -  indicating  upward  or  downward  curving  lines.   These 
stages  were  repeated  with  successive  scans.   A  total  of  six  fea- 
tures were  therefore  selected  and  each  identification  recorded 
individually  by  a  token  placed  in  the  respective  location.   In 
Figure  8  these  locations  are  set  out  as  the  columns  of  matrices 
produced  by  each  of  the  lower  case  letters  (a  through  v)    from  the 
standard  alphabet.   The  order  in  which  the  columns  are  set  out 
from  left  to  right  is  1.   Ascenders,  2.   Descenders,  3.   Ribbon 
Height  Verticals,  4.   Curvature  Downward,  5.   Curvature  Upward, 
6.   Horizontal  Continuity. 
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At  the  present  stage  of  development,  if  characters  from  a 
different  font  had  been  used  or  they  had  not  been  correctly  a- 
ligned,  the  detailed  structure  of  each  of  the  matrices  would 
change.   However,  for  the  case  where  they  originate  from  differ- 
ent digital  copies  of  the  same  character,  broad  similarities  be- 
tween the  matrices  would  still  remain.   A  method  was  sought  that 
would  succeed  in  filtering  out  the  small  discrepancies  in  these 
matrices  and  allow  only  the  gross  differences  between  different 
letters  to  emerge.   The  technique  chosen  to  achieve  this  involved 
the  blurring  of  each  non-zero  binary  token  by  a  Gaussian  function 
(the  equivalent  of  passing  the  output  of  each  channel  through  a 
low  pass  filter) ,  thereby  producing  six  continuous  functions  which 
were  punched  out  in  multiplex  form  on  paper  tape. 

There  are  a  number  of  ways  in  which  the  filtering  effect 
could  have  been  achieved.   The  final  choice  however  was  not  en- 
tirely arbitrary  and  can  be  justified  partly  in  terms  of  expedi- 
ency and  partly  in  an  endeavor  to  make  the  process  parallel  some 
of  the  activities  involved  in  reading  and  speaking.   Thus  the 
effect  of  the  Gaussian  functions  v>7as  to  perform  quasi-integration 
of  the  token  matrices  and,  by  presenting  information  the  form  of 
continuous  waveforms ,  to  provide  a  method  of  achieving  the  coales- 
cence of  groups  of  letters  into  word  units.   An  analogous  situa- 
tion exists  in  sighted  reading  when  under  normal  conditions  the 
reader  is  able  to  recognize  a  word  by  its  overall  shape  rather 
than  by  the  detailed  structure  of  the  letters.   When  context  fails 
to  supplement   this  loss  of  letter  information  the  reader  then  re- 
verts to  a  study  of  individual  letters.   The  corresponding  process 
in  the  simulation  is  controlled  by  varying  the  standard  deviation 
of  the  Gaussian  functions,  although  this  was  never  explored.   As 
the  functions  become  narrower,  it  is  to  be  expected  that  more 
letter  detail  can  be  made  available,  but  with  a  consequent  loss 
in  reading  speed  due  to  the  ear's  limiting  reception  rate  for  dis- 
crete stimuli. 

'  It  will  be  seen  that  the  feature  analysis  program  reduces  the 
amount  of  data  available  at  the  input  by  removing  some  of  the  re- 
dundant information.   Roughly  how  much  data  is  removed  can  be 
shown  by  calculation  based  upon  a  few  simplifying  assumptions.   If 
the  original  input  patterns  are  presented  by  an  optophone-like 
harmonic  display,  which  has  been  thoroughly  learned,  then  it  would 
be  reasonable  to  assume  that  the  reader  would  know  intuitively  the 
likelihood  of  occurrence  of  each  of  the  twelve  elements  of  the 
scan.   Hence  the  reader  can  be  regarded  as  an  optimum  decoding  de- 
vice and  the  infoirmation  contained  in  rows  of  digits  along  a 
length  of  text  can  be  calculated  from  the  Shannon  Wiener  equation 

H  =  -  Ip^  log  p^. 

We  must  also  assume  that  the  interdependence  between  rows  (i.e., 
shared  information)  is  not  perceived  by  the  listener  and  that  they 
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can  therefore  be  analyzed  independently. 

For  the  selected  characters  which  form  the  standard  alphabet, 
the  analysis  on  a  row  by  row  basis,  using  letter  frequencies 
found  in  English,  gives  an  average  information  per  character  of 
37  bits.   The  corresponding  calculation  carried  out  on  the  token 
matrices  derived  from  the  standard  alphabet  requires  the  reason- 
able assumption  that  the  columns  representing  the  parameters  are 
perceptually  independent.   On  this  basis  the  average  information 
per  character  is  approximately  22  bits  and  the  reduction  factor 
is  therefore  about  1.7.   The  ideal  factor  for  complete  recognition 
by  the  processing  device  is 

37 

=  7.9 


log2  26 

which  can  be  approached  more  and  more  closely  as  the  feature  anal- 
ysis is  refined  and  made  more  efficient.   The  greatest  inefficien- 
cy arises  from  the  existing  technique  for  detecting  curves  and 
horizontal  continuity  which  uses  only  two  consecutive  scans.   An 
increase  in  the  horizontal  range  over  which  this  detection  logic 
functions  would  be  one  way  of  improving  efficiency  but  the  com- 
plexity of  the  logic  would  increase  and  make  any  attempt  at  an 
economic  hardware  realization  more  difficult. 

The  paper  tapes  carrying  the  feature  waveforms  were  converted 
into  six  simultaneously  available  analogue  voltage  functions  by 
means  of  an  electronic  demultiplexing  device  and  digital  to  ana- 
logue converter.   A  conversion  rate  of  60  words  per  minute  was 
usually  chosen  which  meant  that  the  Gaussian  functions  had  an  ef- 
fective standard  deviation  of  0.18  sec.   The  set  of  control  func- 
tions could  be  connected  to  one  of  three  outputs;  an  optophone 
display,  a  multidimensional  sound  generator,  or  a  speech  synthesi- 
zer.  Provision  was  made  for  some  by-pass  connections.   For  ex- 
ample, a  six  bit  per  scan  degraded  version  of  the  input  patterns, 
punched  on  paper  tape,  could  be  used  to  trigger  six  tuned  oscil- 
lators thereby  producing  a  conventional  optophone  output  similar 
to  that  produced  by  the  Battelle  reader.   Figure  9  illustrates 
some  of  the  available  connections  between  the  major  components. 

The  optophone  oscillators  were  tuned  to  frequencies  of  1000 
cps,  768  cps ,  640  cps ,  576  cps ,  512  cps ,  and  384  cps ,  and  could 
be  switched  on  abruptly  by  signals  derived  directly  from  the  in- 
put patterns  or  could  be  controlled  by  the  continuous  functions. 
These  six  independent  functions  could  be  made  to  continuously  vary 
individual  oscillator  outputs  over  a  range  of  40  db  by  means  of 
electronic  volume  controls. 

The  multidimensional  sound  generator  (part  of  the  multidimen- 
sional optophone  or  MDO)  was  evolved  from  the  equipment  used  in 
the  experiment  described  above.   Only  four  of  the  original  dimen- 
sions were  used  however;  noise,  square,  wave  modulation,  frequency, 
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and  intensity.   In  addition  the  generator  included  a  circuit  pro- 
ducing a  click.   The  frequency  of  the  output  was  switched  between 
480  cps  and  2000  cps  by  means  of  a  trigger  circuit  with  the  thresh- 
old set  at  the  60  percent  level  and  driven  by  the  control  functions 
The  click  generator  was  controlled  in  a  similar  manner.   The  noise, 
depth  of  modulation,  and  overall  output  intensity  v;ere  all  con- 
trolled by  the  waveforms  via  electronic  volume  controls.   The  fea- 
ture waveforms  were  associated  with  the  components  of  the  multi- 
dimensional generator  in  the  following  manner. 


Ascenders 


Descenders 


-  Click 


Ribbon  Height  Verticals  -  Noise  intensity 
Curvature  Downward      -  Overall  intensity 


Curvature  Upward 


-  Depth  of  Modulation 


Horizontal  Continuity   -  Frequency 

The  speech  synthesizer  used  in  the  experiment  was  an  early 
version  of  the  Parametric  Artificial  Talking  (PAT)  device  designed 
by  Lawrence  (2,  11).   The  device  is  an  electrical  analogue  of  the 
human  vocal  mechanism  and  comprises  components  which  simulate  the 
action  of  the  larynx  and  the  cavities  of  the  vocal  tract  (Figure 
10).   The  larynx  waveform  generator  produces  a  pulsed  output 
whose  frequency  and  loudness  can  be  controlled  by  applied  para- 
meter voltages.   This  signal  is  fed  through  three  filters  con- 
nected in  series  whose  resonant  frequencies  are  also  controlled 
by  external  voltages.   The  filters  simulate  the  action  of  the 
vocal  cavities  and  impress  the  characteristic  formant  patterns 
of  speech  on  the  emerging  waveform.   A  sixth  parameter  controls 
the  output  of  a  noise  generator  which  is  mixed  with  the  filter 
output,  amplified,  and  fed  to  a  loudspeaker.   The  six  controls 
required  by  this  system  correspond  to  the  six  sources  of  informa- 
tion arising  from  the  computer  program.   With  the  exception  of 
the  experiment  described  on  pages  31  through  34  the  following 
features  were  employed  to  control  the  various  components  of  the 
synthesizer. 


1.  Ascenders 

2.  Ribbon  Height  Verticals 

3.  Curvature  Downward 

4 .  Curvature  Upward 

5.  Horizontal  Continuity 


-  Noise 

-  Formant  3 

-  Formant  2 

-  Formant  1 

-  Larynx  Frequency 
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6.       1  ! -  Larynx  Amplitude 

A  trial  passage  of  text,  "The  quick  brown  fox  jumps  over  the 
lazy  dog,"  was  processed  by  the  computer  program  and  the  waveforms 
compared  with  those  required  to  generate  intelligible  speech. 
Allocation  of  the  control  waveforms  to  components  of  the  PAT 
machine  was  decided  on  the  basis  of  achieving  the  closest  pos- 
sible match  with  the  speech  parameters.   The  sixth  control  took 
the  form  of  a  rectangular  waveform  switching  on  when  parameter  5 
exceeded  zero  volts  and  off  when  the  parameter  fell  once  again 
to  zero. 

Details  of  some  aspects  of  circuit  design  are  given  in  Appen- 
dix C  for  the  more  important  apparatus  used  in  this  work. 

THE  METHODS  USED  FOR  EVALUATING 
DIFFERENT  AUDITORY  OUTPUTS 

Two  basic  methods  of  training  subjects  were  used.   The  first  aimed 
at  monitoring  the  rate  of  learning  in  the  very  early  stages  under 
forced  choice  conditions  and  the  second  method  measured  perfor- 
mance after  several  hours  of  self-paced  tuition. 

To  permit  a  comparison  of  our  results  with  those  of  two  ear- 
lier workers  (Cooper  [5]  and  Abma  [1]),  a  method  derived  from  that 
described  originally  by  Cooper  was  used.   Thus,  what  was  termed 
the  "Haskins  Method"  utilized  a  vocabulary  of  eight  words  which 
were  transformed  into  different  sound  patterns.   Tape  recordings 
which  displayed  each  set  of  these  sounds  were  made  in  two  sections. 
Section  1  was  composed  of  an  introduction  to  the  eight  word  sounds 
in  which  each  sound  was  played  twice.   Section  2  was  formed  by 
the  test  passages  in  which  eight  or  sixteen  series,  each  of  24 
sounds,  were  played  with  a  break  of  45  seconds  between  each  series. 
Within  each  series  each  sound  appeared  a  total  of  three  times  with 
the  order  of  appearance  randomized. 

The  presentation  sequence  for  each  word-sound  followed  a  pro- 
cedure in  which  the  sound  was  played  once,  there  was  then  a  pause 
of  3  seconds  while  the  subject  recorded  his  identification,  where- 
upon the  sound  was  played  again  and  immediately  followed  by  the 
correct  answer.   A  further  4  seconds  elapsed  and  the  cycle  began 
again  with  a  new  word-sound.   Each  si±)ject  learned  the  response 
list  of  words  before  the  experiment  began.   Between  2  and  3  minutes 
were  set  aside  for  this  purpose  before  Section  1  of  the  tape  re- 
cording was  begun.   Having  heard  each  word-sound  played  twice  the 
subjects  proceeded  to  the  test  Section.   The  word  identified 
with  each  sound  was  written  down  by  the  subject  on  a  sheet  of  pa- 
per.  Each  subject's  progress  was  measured  in  terms  of  the  number 
of  sounds  correctly  identified  in  each  series  of  24.   From  this 
data  learning  curves  could  be  compared  with  one  another. 
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The  long  term  self-paced  training  program  was  called  the 
"NPL  Method. "   Recordings  consisted  of  a  short  introduction  termed 
Section  1  in  which  each  letter  or  word-sound,  taken  in  turn,  was 
played  five  times ;  Section  2  in  which  the  8  sounds  were  played 
through  three  times;  Section  3  in  which  pairs  of  sounds  which 
promised  to  be  difficult  to  distinguish  were  played  alternately 
to  assist  discrimination;  and  Section  4  in  which  sounds  were 
played  in  random  order  and  their  identity  announced  after  each 
presentation.   Following  the  training  sections  were  18  practice 
passages  each  of  20  sound  symbols  and  the  final  tests  of  240 
symbols.   Correct  answers  were  provided  for  all  the  practice  se- 
quences and  subjects  were  thus  able  to  monitor  their  own  progress. 

The  length  of  the  training  period  varied  between  three  and 
six  hours  but  very  little  improvement  in  performance  was  noted  in 
the  last  three  hours;  in  fact,  most  subjects  achieved  a  50  per- 
cent correct  level  of  accuracy  within  the  first  hour.   The  final 
tests  provided  the  data  for  overall  analysis  and  consisted  of  six 
sequences  of  40  symbols.   Two  sequences  were  played  at  15  symbols 
per  minute,  a  further  two  sequences  at  20  symbols  per  minute  and 
the  remainder  at  25  symbols  per  minute.   With  the  exception  of  one 
experiment,  all  the  subjects  trained  by  this  method  were  provided 
with  individual  tape  recorders  in  a  language  laboratory.   They 
could  therefore  test  themselves  on  the  material  provided  and  when 
necessary  rewind  the  tape  back  to  the  introductory  sections  and 
investigate  the  source  of  their  errors.   The  responses  to  the  fin- 
al tests  were  written  down  and  instructions  were  given  not  to 
stop  the  tape  recorders  during  this  period. 

THE  EXPERIMENTS  AND  THEIR  ANALYSIS 

The  experiments  described  below  were  each  performed  with  the  ob- 
ject of  gathering  evidence  bearing  upon  what  advantages  if  any 
accrue  from  the  use  of  data  compression  or  a  speech-like  output 
in  a  reading  machine.   Even  if  the  answers  promised  to  be  clear 
cut  at  the  start  of  an  experiment,  at  the  end  it  was  sometimes 
found  that  the  results  could  not  be  interpreted  in  quite  the  way 
expected.   An  example  of  just  such  an  experiment  was  the  attempt 
to  measure  the  effects  of  removing  some  sequential  redundancy  from 
the  optophone,  while  still  maintaining  the  same  dimensionality  of 
the  display.   Two  sound  displays  were  therefore  generated  using 
the  units  linked  by  Lines  E  and  F  in  Figure  9. 

The  system  connected  by  Line  E  produced  a  "conventional  opto- 
phone" output  similar  to  the  Battelle  reader.   The  connections 
shown  by  Line  F  produced  a  system  called  the  "compressed  optophone" 
which  operated  upon  the  degraded  6-bit  scan  and  identified  Ascen- 
ders, Descenders,  Ribbon  Height  Verticals,  etc.,  and  switched  on 
oscillators  associated  with  a  particular  feature  as  follows. 
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TABLE  3 


1000  cps  Ascenders 

76  8  cps  Curvature  upward 

640  cps  Horizontal  lines 

576  cps  Ribbon  height  verticals 

512  cps  Curvature  downward 

384  cps  Descenders 

Figure  11  shows  the  results  of  this  processing  procedure 
functioning  with  the  lower  case  letter  r.   Thus  the  "compressed 
optophone"  display  is  produced  by  the  feature  detection  program 
already  described  which  controls  the  different  frequency  channels 
through  a  paper  tape  reader.   There  is  no  delay  or  integration 
process  involved  and,  therefore,  the  serial  form  of  the  informa- 
tion emerging  from  the  computer  is  preserved  in  the  output.   Eight 
lower  case  letters  were  chosen  as  the  source  material;  /,  i,  k^ 
3 »    Pi    q y    ^»    s. 

Using  the  NPL  Method,  two  teams  of  5  paid  subjects,  aged  be- 
tween 18  and  20  years,  were  engaged  and  trained  for  a  period  of 
3-1/4  hours  on  one  of  the  two  groups  of  sounds.   When  the  tests 
were  completed  the  two  teauns  were  exchanged  and  the  training  and 
testing  procedure  repeated. 

Confusion  matrices  were  constructed  from  the  final  test  data 
for  the  pooled  results  of  the  two  teams  and  a  multivariate  analy- 
sis carried  out  (see  Appendix  D) .   The  results  of  the  analysis 
and  a  statistical  test  of  significance  are  shown  in  Tables  4  and  5 


Output 


TABLE  4 

Information 
Transmitted 


Max.  Possible 
Information  Transfer 


Conventional    1.9  2  bits/symbol 
Compressed      2.46  bits/symbol 


2.99  bits/symbol 
2.99  bits/symbol 


Table  5  shows  that  the  performance  of  the  compressed  opto- 
phone system  was  significantly  superior  to  the  conventional  opto- 
phone.  This  occurred  despite  the  fact  that  the  variances  in  the 
presentation  length  among  the  letters  was  larger  in  the  case  of 
the  conventional  optophone  and  therefore  provided  additional  cues 
for  identification. 
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TABLE  5 

Mean  of  No. 
Output       Correct  (Max.  40)    Variance    "t"    Significance 


Conventional        30.70  9.12 

Compressed  36.47  5.00 


3.03     P  =  0.015 


The  pooled  results  of  10  subjects, 
df  =  9 

The  results  of  the  experiment  appear  to  point  to  the  view 
that  reducing  the  redundancy  present  in  the  display  can  produce  a 
significant  improvement  in  response  accuracy.   However,  this  de- 
pends upon  a  number  of  assumptions.   Taking,  as  an  example,  the 
general  case  of  a  transmitter,  altogether  there  are  three  varia- 
bles involved;  the  dimensionality  of  the  output,  the  rate  at  which 
information  is  being  transmitted,  and  the  amount  of  redundancy 
present.   All  these  variables  are  interdependent,  for,  if  the  di- 
mensionality is  held  constant  and  redundancy  is  reduced,  the  in- 
formation transmission  rate  must  automatically  rise.   Thus,  the 
deliberate  alteration  of  any  one  variable  results  in  the  actual 
alteration  of  at  least  two.   The  observations  of  this  experiment 
are  being  made  on  the  transmitter  and  receiver  as  a  single  entity 
and  the  interpretation  of  these  results  in  terms  of  the  response 
of  the  receiver  to  one  particular  altered  parameter  must  assume 
that  the  other  factors  which  are  also  varied  have  no  effect.   In 
fact,  by  examining  the  subjects'  introspective  observations,  this 
assumption  was  shown  to  be  not  valid.   Thus  changes  in  all  three 
variables  discussed  above  could  have  been  perceived  and  it  is  not 
possible  to  draw  the  unequivocal  conclusion  that  the  experiment 
demonstrates  the  need  for  the  removal  of  some  of  the  geometric 
redundancy  of  print.   Without  much  more  knowledge  about  precisely 
what  are  the  perceived  dimensions  in  an  acoustic  signal  there  is 
little  prospect  of  avoiding  this  dilemma. 

The  first  experiment  using  the  speech-like  output,  produced 
by  the  components  connected  by  Line  B  (Figure  9),  explored  the 
differences  in  recognition  accuracy  between  letter  symbols  pro- 
duced by  PAT  and  those  from  the  conventional  optophone.   Further, 
the  experiment  sought  to  discover  whether  there  was  any  correla- 
tion between  visual  discriminability  at  the  parameter  waveform 
level,  and  auditory  discrimination  tests  carried  out  on  the  so\inds 
produced.   To  this  end  the  parameter  waveforms  representing  each 
of  the  26  lower  case  letters  were  drawn  on  separate  cards  and  a 
number  of  subjects  were  asked  "to  classify  them  into  groups  having 
broadly  similar  characteristics."   All  subjects  chose  the  most 
obvious  step,  namely  to  divide  the  cards  in  such  a  way  that  each 
group  contained  non-zero  values  in  common  parametric  positions 
(see  Figure  12).   The  subjects  were  then  asked  to  take  each  group 
in  turn  and  order  the  cards,  placing  the  card  with  the  most  strik- 
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ing  features  on  the  left  and  ranging  the  remainder  linearly  so 
that  the  differences  between  adjacent  cards  were  minimal.   The 
results  were  remarkably  consistent.   The  most  frequently  adopted 
ordering  is  shown  in  Figure  12,  and  can  be  identified  by  refer- 
ence to  Table  6. 

TABLE  6 


J 

m 

n 

r 

u 

1 

k 

d  b 

h 

1 

P 

q 

w 

z 

s 

e 

g 

X   c 

V  y 

o 

a 

f 

t 

On  the  basis  of  this  ordering  two  groups  of  eight  letters 
were  selected.   The  first  (Group  1)  consisted  of  the  eight  adja- 
cent characters,  s,  e,    g^    x,  Oy    v,  z/,  o    from  the  bottom  row,  which 
were  termed  the  similar  letters.   The  second  (Group  2)  was  com- 
posed of  eight  characters  selected  from  the  extremes  of  each  of 
the  principal  groupings,  e.g.,  jj  m,    i^    k^     I,  Pj  u,  t. 

Recordings  of  the  speech  sounds  were  made  on  this  occasion 
at  the  Phonetics  Department  of  Edinburgh  University  and  the  con- 
nections to  the  control  parameters  differed  slightly  from  that 
given  on  pages  25  through  27.   The  six  control  waveforms  were 
connected  to  the  synthesizor  as  follows: 


1. 

Ascenders 

-  Hiss  1  S 

2. 

Descenders 

-   Hiss  2   s 

3. 

Ribbon  Height  Verticals 

-  Formant  3 

4. 

Curvature  Downward 

-  Formant  2 

5. 

Curvature  Upward 

-  Formant  1 

6. 

Horizontal  Continuity 

-  Larynx  Frequency 

7. 

1          1 

-  Larynx  Amplitude 

The  seventh  control  was  a  rectangular  waveform  switching 
on  when  parameter  six  exceeded  zero  volts  and  off  when  the  para- 
meter fell  once  again  to  zero.   Two  recordings  were  made  with 
these  sounds,  which  taught  an  association  between  them  and  the 
letters  A   through  H   or  the  numbers  1  through  8.   A  third  group  of 
lower  case  letters  {b ,    f,    g^    q^,    u,    v,    w^    z)   which  had  been  selected 
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at  random  were  recorded  with  the  conventional  optophone  system, 
and  a  training  tape  constructed  which  related  these  sounds  (Group 
3)  to  the  letters  S   through  Z. 

Twelve  paid  subjects  were  divided  into  six  teams  and  were 
trained  by  the  NPL  Method  on  all  three  groups  of  sounds.   Order 
effects  were  balanced  by  the  use  of  a  3  by  3  latin  square  proce- 
dure.  When  all  the  data  had  been  collected  the  confusion  matrices 
for  the  groups  were  constructed  and  the  means  and  variances  cal- 
culated for  the  pooled  results  from  the  final  tests. 

The  results  of  a  multivariate  informational  analysis  of  the 
confusion  matrices  are  shown  in  Table  7, 


TABLE  7 

Information        Max.  Possible 
Group     Transmitted     Information  Transfer 

1  2.61  bits/symbol      2.99  bits/symbol 

2  2.57  bits/symbol      2.99  bits/symbol 

3  1.83  bits/symbol      2.99  bits/symbol 

The  information  transfer  achieved  with  the  two  speech-like 
outputs  is  clearly  shown  to  be  superior  to  that  achieved  with  the 
optophone  system.   The  statistical  significance  of  this  result 
is  illustrated  by  Table  8. 

TABLE  8 

Mean  of 
Group   No.  Correct   Vari-  "t"  Significance 

(Max.  40)     ance 

1  37.6       11.3    between  1  &  2  =  1.76      P  =  0.1 

2  36.3        18.5     between  2  &  3  =  4.58     P   0.001 

3  28.6        42.8     between  1  &  3  =  5.63     P   0.001 

The  pooled  results  of  12  subjects 
df  =  11 

The  other  fact  that  emerges  is  that  the  performance  on  the 
dissimilar  and  similar  section  of  control  waveforms  is  not  signi- 
ficantly different.   However,  a  closer  examination  of  the  confusion 
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matrices,  shown  in  Appendix  D,  reveals  that  5  7  percent  of  the 
total  errors  occurring  for  the  dissimilar  Group  2  sounds  is  con- 
tributed by  a  confusion  between  only  one  pair  of  symbols  /.  and  E 
alias  J  and  I.       If  the  two  highest  sources  of  error  for  the  sounds 
of  Groups  1  and  2  are  neglected  (13  percent  in  the  case  of  Group 
1) ,   the  dissimilar  sounds  are  found  to  be  superior  in  performance 
to  the  similar  sounds.   The  reason  for  the  confusion  arises  be- 
cause the  power  of  the  noise  component,  indicating  the  presence 
of  an  ascender  for  I,    is  very  low.   Furthermore,  the  variation 
in  the  Formant  2  parameter  for  I,    indicating  curvature,  is  also 
small  and,  although  differences  between  the  sounds  are  obvious 
if  they  are  rapidly  alternated,  absolute  discrimination  of  the 
differences  is  extremely  difficult. 

To  study  the  effects  of  random  noise  and  a  restricted  band- 
width on  the  discrimination  accuracy,  the  final  test  material  was 
readministered  to  the  subjects  through  poor  reproducing  equipment 
having  a  pass  band  extending  from  100  cps  to  1.8  kc/sec  and  re- 
ducing by  12  db  per  decade  at  higher  frequencies.  The  noise  lev- 
el was  in  the  region  of  36  db  with  respect  to  the  peak  signal  in- 
tensity. Table  9  shows  the  scores  obtained  on  this  occasion  ex- 
pressed as   a  percentage  of  the  score  obtained  in  the  earlier  test 

TABLE  9 

Group     Score  -  Second  Attempt 
(Percent) 

1  92 

2  90 

3  64 


These  results  show  that  the  speech-like  output  has  two  prin- 
cipal advantages;  first,  it  can  produce  sounds  which  are  more 
easily  discriminated  than  those  from  the  optophone;  second,  the 
speech  output  is  far  less  vulnerable  to  the  effects  of  poor  re- 
production equipment  and  noisy  environment.   The  question  arises 
whether  these  properties  are  unique  to  speech  or  whether  they  are 
shared  by  all  multidimensional  signals.   In  the  following  experi- 
ment the  rate  of  learning  for  multidimensional  audio  signals  was 
compared  with  speech-like  sounds  and  the  results  of  some  earlier 
work. 

In  "Research  on  Reading  Machines  for  the  Blind"  (5) ,  the 
final  report  published  by  F.S.  Cooper  and  P. A.  Zahl  in  1947,  an 
experiment  was  described  which  furnished  learning  curves  for  a 
number  of  projected  reading  machine  outputs.   The  principal  point 
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that  emerged  from  the  Haskins  experiment  was  that  a  speech-like 
code  derived  from  spoken  phonemes  produced  a  performance  superior 
to  a  wide  range  of  direct  coding  reading  aids.   A  basic  vocabu- 
lary of  eight  words,  identical  with  that  used  by  Cooper,  was  used 
here  as  the  generating  source  for  three  audio  outputs.   These 
were  sounds  produced  by  the  systems  connected  by  Lines  C  and  B 
in  Figure  9  and  termed  the  MDO  and  PAT  sounds,  and  in  addition 
a  natural  soeech  code  (called  Wuhzi)  generated  by  a  human  speaker 
according  to  the  rules  described  in  Haskins  report.   Thus  the 
speech  code  was  derived  from  the  word  list  by  substituting  vowels 
by  other  vowels  and  stop  consonants  by  other  stop  consonants, 
etc.   The  vocabulary  and  code  words  are  listed  in  Table  10. 


TABLE  10 

With 

Yekw 

Will 

Yemm 

Were 

Yini 

From 

Snal 

Been 

Jiir 

Have 

Wozi 

This 

Kwef 

That 

Kwok 

The  wavefoinns  produced  by  the  feature  analysis  program  are  shown 
in  Figure  13. 

Sixteen  volunteer  subjects,  drawn  from  the  staff  of  Autono- 
mics Division,  were  divided  into  four  equal  teams  who  worked  with 
all  three  groups  of  word-sounds  in  different  orders.   The  training 
technique  followed  the  "Haskins  Method"  and  the  learning  curves 
that  were  produced  are  shown  in  Figure  14.   The  curve  for  the  code 
Wuhzi  coincided  exactly  with  that  traced  by  the  Haskins  research 
workers  and  it  was  therefore  reasonable  to  assume  that  the  condi- 
tions of  their  experiment  had  been  accurately  replicated.   For 
this  reason  the  results  obtained  at  Haskins  Laboratories  with  the 
optophone  and  a  number  of  other  systems  have  been  plotted  on  the 
same  graph  for  comparison  purposes. 

In  Table  11  the  results  are  shown  of  a  chi-squared  test  made 
between  the  "observed"  results  obtained  with  the  PAT  and  MDO  dis- 
plays and  the  expected  performance,  which  for  the  present  purpose 
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TABLE  11 
Output       Chi-Squared       Significance 


MDO 

20.19 

P    =    0,007 

PAT 

3.74 

P    =    0.81 

A  chi-squared  test  on  data  from  Figure  14 
df  =  7 

has  been  taken  to  be  that  of  the  optophone.   The  MDO  curve  is 
found  to  be  significant  to  rather  better  than  the  1  percent  level, 
while  the  difference  between  the  PAT  curve  and  the  optophone  could 
easily  arise  by  chance. 

There  are  two  observations  drawn  from  Figure  14  that  are  worth 
noting.   First,  the  learning  curve  for  the  PAT  output  rises  more 
slowly  than  the  MDO  curve  and,  second.,  extrapolation  beyond  the 
eighth  series  predicts  that  the  curves  will  cross  over  and  the  PAT 
output  will  give  an  ultimately  superior  performance.   A  possible 
explanation  of  the  crossover  may  lie  in  the  fact  that  the  sounds 
differ  in  their  basic  structure.   In  particular,  the  PAT  sounds 
are  smoothly  varying,  whereas  the  MDO  sounds  contain  a  number  of 
transients;  typically  when  the  frequency  switches  from  480  cps  to 
2000  cps.   Thus  the  MDO  sounds  have  a  marked  rhythm  which  may  alone 
carry  sufficient  information  to  specify  eight  alternatives.   One 
can  proceed  to  speculate  that  there  are  two  mechanisms  involved. 
The  first  controls  the  discrimination  of  stress  and  timing  and 
functions  quickly  thus,  the  first  features  that  are  memorized  are 
the  transients.   With  further  exposure  to  the  sounds  a  second  mech- 
anism may  come  into  play  and  enhance  the  discrimination  process ; 
for  example,  if  the  mechanisms  of  the  feedback  theory  do  exist  and 
the  sound  can  be  mimicked,  it  might  take  time  for  a  reference  link 
to  be  established  between  the  acoustic  pattern  and  the  listener's 
own  verbal  representation  of  the  sound.   Equally,  as  shown  below, 
other  mechanisms  can  be  postulated  that  would  achieve  the  same  re- 
sult. 

A  possible  explanation  of  the  slower  learning  performance 
with  the  PAT  sounds  might  be  that  arising  from  the  observations 
of  House,  et  al.  (10).   These  workers  found  that  for  a  series  of 
sounds,  ranging  from  unidimensional  to  multidimensional  and  final- 
ly to  speech  stimuli,  the  performance  during  learning  improved 
as  the  number  of  dimensions  increased  but,  as  the  machine  gerer- 
ated  stimuli  became  more  like  speech,  performance  deteriorated. 
However,  for  natural  speech  signals,  performance  was  at  once  su- 
perior to  all  other  stimuli.   The  reason  suggested  by  House,  et 
al.  was  that  the  synthesized  and  natural  speech  sounds  are  inter- 
preted as  linguistic  events  and  are  discriminated  by  an  entirely 
different  mechanism  to  the  multidimensional  stimuli.   All  the 
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speech  sounds  are  therefore  presented  within  a  linguistic  frame 
and  are  categorized  accordingly.   It  is  because  the  synthesized 
speech  fails  to  fit  this  framework  precisely  that  the  performance 
is  consequently  poorer. 

To  investigate  whether  a  crossing  of  the  PAT  and  MDO  curves 
would  actually  occur  if  the  experiment  were  extended  beyond  the 
eighth  series,  a  longer  version  of  this  experiment  was  performed 
on  a  second  group  of  subjects.   In  addition,  to  obtain  some  indi- 
cation of  whether  the  phenomena  reported  by  House,  et  al.  were 
active  in  depressing  performance  with  synthesized  speech  sounds, 
the  Wuhzi  vocabulary  was  replaced  by  the  same  words  synthesized 
on  the  PAT  apparatus.   The  quality  of  articulation  of  these  words 
was  very  much  poorer  than  those  produced  by  the  human  speaker  in 
the  previous  experiment,  and  it  provided  an  opportunity  to  com- 
pare the  performance  of  synthesized  and  actual  speech  under  con- 
ditions which  by  no  means  favored  the  synthetic  speech  sounds. 
Included  with  the  Wuhzi,  PAT,  and  MDO  sounds  was  a  fourth  set  of 
word  sounds  generated  by  a  variant  of  the  optophone  in  which  the 
output  intensities  of  the  six  tuned  oscillators  could  be  con- 
trolled by  the  same  continuous  functions  that  controlled  the  PAT 
and  MDO  output  devices.   This  output  is  termed  the  variable  vol- 
ume optophone  or  WO. 

Twenty- four  paid  subjects,  aged  between  16  and  25  years, 
were  engaged  and  given  the  task  of  learning  all  four  sets  of  word 
sounds.   The  subjects  were  grouped  in  pairs  and  order  effects 
reduced  by  the  use  of  an  orthogonal  4  by  4  latin  square  design. 
Two  subjects  resigned  before  the  experiment  was  completed  and 
the  results  of  the  remaining  22  were  pooled  to  give  the  data 
plotted  in  Figure  15.   The  performance  of  the  synthesized  Wuhzi, 
over  the  first  five  series,  is  identical  with  that  found  earlier 
with  the  naturally  produced  word  sounds.   Beyond  the  fifth  series 
the  curve  does  not  continue  to  rise  as  steeply  and  the  perfect 
score  level  is  achieved  with  consistency  by  about  18  of  the  22 
subjects.   The  results  of  chi-squared  tests  of  the  differences 
in  performance  between  the  WO  sounds  and  the  MDO,  PAT,  and  Wuhzi 
outputs  in  turn  are  shown  in  Table  12.   These  provide  only  a 
rough  test  of  significance  because  the  method  of  calculation  does 
not  eliminate  subject  variances. 

TABLE  12 
Output     Chi-Squared     Significance 


MDO 

7.76 

P   =    0.93 

PAT 

8.33 

P   =    0.91 

WUHZI 

24.45    (df   = 

14) 

P   =    0.03 

A  chi-squared  test  on  data  from  Figure  14 
df  =  15 
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A  comparison  of  the  results  of  the  two  experiments  suggests 
that  the  differences  between  the  performances  are  more  reliable 
than  the  chi-squared  test  indicates.   For  example,  the  slower 
initial  learning  rate  for  the  PAT  sounds  compared  with  the  MDO 
output  is  repeated  in  the  second  experiment  and  the  cross-over 
predicted  by  the  earlier  data  does  indeed  occur.   It  is  also  in- 
teresting to  note  that  both  the  MDO  and  PAT  outputs  permit  bet- 
ter performances  than  the  WO  system  despite  the  fact  that  all 
three  systems  are  controlled  by  the  same  waveforms  and  therefore 
receive  the  same  input  information.   One  other  incidental  obser- 
vation is  that  some  of  the  information  contained  in  the  WO  is 
more  clearly  represented  than  in  the  PAT  output.   This  is  parti- 
cularly true  for  the  waveform  which  controls  the  third  formant 
of  the  speech-like  display.   Variation  of  formant  3  makes  a 
barely  discriminable  difference  to  the  output  even  in  a  relative 
judgment  situation.   Thus,  despite  the  loss  of  this  source  of  in- 
formation the  sounds  are  still  superior  to  the  WO  output.   The 
choice  of  the  control  parameter  allocation  for  the  PAT  output 
used  in  these  experiments  was  largely  ad  hoc  and  it  is  clear 
that  a  significant  improvement  would  result  from  a  reallocation 
of  the  "Ribbon  Height  Vertical"  control  to  some  other  more  dis- 
criminable feature  of  the  speech  display. 

It  is  possible  to  draw  three  conclusions  from  these  results. 
First,  that  the  data  are  consistent  with  the  view  that  the  intro- 
duction  of  transient  or  consonant-like  sounds  can  increase  the 
rate  of  learning  particularly  in  the  initial  stages.   Second, 
in  agreement  with  House,  et  al.  ,  the  multidimensional  outputs 
can  increase  the  rate  of  learning  but,  finally,  there  is  no  evi- 
dence that  synthesized  speech  sounds  lead  to  a  measurably  slow- 
er rate  of  learning  than  can  be  obtained  by  the  same  methods 
from  natural  speech. 

Apart  from  the  differences  in  learning  time,  the  Haskins 
and  NPL  training  methods  differ  in  the  manner  and  speed  of  the 
feedback.   With  the  Haskins  method,  feedback  is  received  in  a 
matter  of  seconds,  but,  in  general,  with  the  NPL  method  the  re- 
sponses are  corrected  minutes  later  when  a  practice  test  of  20 
sounds  has  been  completed.   Studies  of  the  effects  of  delay  of 
information   upon  learning  have  shown  that  quite  dramatic  changes 
in  performance  can  occur  and,  to  investigate  whether  the  results 
of  the  previous  experiment  owed  anything  to  the  training  methods, 
a  further  experiment  using  the  NPL  procedure  was  mounted  with 
the  MDO,  PAT,  and  WO  outputs. 

Twelve  paid  si±ijects  of  the  26-25  age  group  were  used  on 
this  occasion  and  the  usual  precautions  taken  to  avoid  order  ef- 
fects.  A  total  of  3-1/4  hours  were  spent  on  learning  each  group 
of  sounds  and  the  results  of  the  final  tests  for  each  output  were 
pooled  to  form  confusion  matrices.   The  results  of  an  information- 
al analysis  is  shown  in  Table  13. 
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TABLE  13 

Output     Information  Transmitted      Maximum  Possible 

Information  Transfer 


MDO         2.24  bits/symbol  2.99  bits/sym:ool 

PAT         2.21  bits/symbol  2.99  bits/symnol 

WO         1.55  bits/symbol  2.99  bits/symbol 

A  statistical  analysis  of  the  significance  of  the  differences 
between  the  three  outputs  is  shown  in  Table  14. 

TABLE  14 

Mean  of 
Output    No.  Correct    Vari-  "t"  Signifi- 

(Max.  =  40)     ance  cance 

1.  MDO        33.8         2  5.7      between  1&20.28     P=0.8 

2.  PAT        34.2         31.02     between  2  &  3  4.36     P  =  0.001 

3.  WO       27.3        53.58    between  1  &  3  4.38     P  =  0.001 

The  pooled  results  of  12  subjects 
df  =  11 

From  Table  14  the  average  number  of  correct  responses  with 
the  PAT  display  is  seen  to  be  slightly  better  than  that  from  the 
MDO  output,  but  Table  13  indicates  that  the  amount  of  information 
transmitted  by  the  PAT  system  is  nevertheless  less  than  the  MDO. 
The  reason  for  this  apparent  disagreement  arises  from  the  fact 
that  figures  for  both  the  variance  and  the  number  of  correct  re- 
sults are  employed  in  the  calculation  of  information  transmitted, 
and  the  variance  of  responses  to  the  PAT  output  is  higher  than 
to  the  MDO  output.   The  differences  between  the  PAT  and  MDO  out- 
puts and  the  WO  sounds  are  highly  significant  and,  although  the 
PAT  performance  is  still  slightly  better  than  the  MDO  suggesting 
that  the  different  training  technique  does  not  affect  the  perfor- 
mance, this  difference  is  not  significant.   An  analysis  of  the 
principal  sources  of  variance  is  shown  in  Table  15.   "Codes"  re- 
fers to  PAT,  MDO,  and  WO  outputs. 

The  results  of  a  Z  test  on  the  data  of  Table  15  are  shown 
in  Table  16. 
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TABLE  15 

Sum  of  Squares 
of  Deviation 


Degrees  of 
Freedom 


Mean  Square 
Deviation 


Codes 

2160.1 

2 

1080.5 

Order 

1844.9 

5 

368.98 

Tests 

239.7 

11 

21.78 

Codes 

X  Order 

830.8 

10 

83.08 

Codes 

X  Tests 

706.8 

22 

42.13 

Order 

X  Tests 

3051.0 

55 

55.47 

Codes 

X  Order  x  Tests 

1371.0 

110 

12.46 

TABLE  16 


Degrees  of 

Freedom 

Probability  of  Variance 
Arising  from  Random  Sampling 

Z 

ni 

^2 

Codes 

2.23 

2 

110 

P  <  <  1^ 

Order 

1.69 

5 

110 

P    <    1% 

Tests 

0.28 

11 

110 

5^  <  P  <  105? 

The  analysis  shows,  as  indicated  by  the  previous  experiment, 
that  the  differences  in  performance  with  the  codes  is  highly  sig- 
nificant.  Both  order  and  subject  variances  are  contained  under 
the  entry  labeled  order,  but,  because  order  effects  have  been  made 
small  by  the  latin  square  arrangement,  subject  variations  repre- 
sent the  largest  contribution.   The  Z  test  shows  that  these  vari- 
ations are  significant  at  little  better  than  the  1  percent  level; 
also  the  variation  in  speed  of  presentation  of  the  final  tests 
had  no  significant  influence  on  the  performance. 

On  page  9  an  alternative  to  the  feedback  theory  was  presented 
which  attempted  to  explain  some  features  of  the  responses  to  vowel 
and  consonant  stimuli.   To  test  this  theory  an  experiment  was  de- 
signed which  endeavored  to  find  out  whether  the  same  kind  of  dis- 
crimination behavior  could  be  demonstrated  with  nonspeech  sounds 
mapping  similar  trajectories  in  the  frequency-time  plane. 

A  frequency  modulated  oscillator  (Appendix  C)  was  used  to 
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generate  two  series  of  thirteen  stimuli,  each  lasting  for  400  msec, 
The  first  series  was  composed  of  continuous  tones  spaced  at  equal 
logarithmic  intervals  in  the  range  from  300  cps  to  1200  cps .   The 
second  series  of  sounds  (the  transients)  traced  a  path  in  the 
frequency-time  plane  similar  to  that  followed  by  the  second  for- 
mant  of  the  /b/  /d/  and  /g/  consonants  shown  in  Figure  1,  but  were 
sited  centrally  in  the  300  cps  to  1200  cps  frequency  band.   Thus, 
the  transient  sounds  consisted  of  a  changing  frequency,  100  msec 
in  length,  emanating  from  one  of  thirteen  equally  spaced  points 
along  the  frequency  band  and  terminating  in  a  continuous  tone  at 
750  cps  which  lasted  for  the  remaining  300  msec. 

Twelve  volunteer  subjects  from  the  laboratory  were  recruited 
and  given  the  task  of,  first,  making  relative  discriminations  of 
sounds  presented  in  an  ABX  mode  and,  second,  of  classifying  the 
sounds  into  three  groups  called  Low,  Medium,  and  High. 

An  ABX  presentation  consisted  of  any  two  different  stimuli 
A  and  B  selected  from  the  series  of  tones  or  transients  followed 
by  the  sound  X  whose  true  identity  was  either  A  or  B.   Thus, 
the  subject's  task  was  to  identify  sound  X  with  either  A  or  B 
and  write  down  the  appropriate  letter.   Three  sessions  each  con- 
sisting of  ten  series  of  20  trials  were  administered  to  the  sub- 
jects.  Session  one  tested  the  discrimination  of  adjacent  sounds 
from  the  range,  also  sounds  separated  by  two  stimuli  all  in  ran- 
dom order.   Session  two  presented  sounds  separated  by  two,  three, 
and  four  stimuli  and  session  three  included  sounds  separated  by 
four,  five,  and  six  stimuli.   Average  numbers  of  correct  discri- 
minations for  each.   Sessions  for  both  Tone  and  Transient  stimuli 
are  shown  in  Table  17. 


TABLE  17 
Stimuli      Session  1     Session  2     Session  3 


%  %  % 

Tones  97  99  100 

Transients         67  82  91 

As  the  stimuli  A  and  B  are  drawn  from  more  widely  separated 
points  in  the  range  so  discrimination  becomes  more  accurate. 
Table  17  also  shows  that  relative  discriminations  can  be  carried 
out  more  easily  with  tonal  than  with  transient  stimuli. 

In  the  classification  experiment  the  tones  were  categorized 
according  to  three  pitch  levels  while  the  transients  were  catego- 
rized according  to  whether  the  transitions  originated  from  Low, 
Medium,  or  High  points  in  the  frequency  band.   The  subjects  were 
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instructed  to  call  the  first  4  sounds  Low,  the  5  sounds  in  the 
middle  of  the  range  Medium,  and  the  remaining  4  sounds  High.   The 
preliminary  instruction  was  concluded  by  playing  the  entire  range 
of  sounds  through  once  and  following  this  by  the  three  groups  of 
sounds  prefaced  by  the  category  names.   The  test  recordings  con- 
sisted of  10  groups  of  20  sounds  each  separated  by  a  three  sec- 
ond interval.   Having  received  their  instructions  the  subjects 
listened  to  the  two  test  recordings  and  wrote  down  their  decisions 
on  a  response  sheet. 

On  the  basis  of  their  skill  at  categorizing  both  kinds  of 
stimuli  six  subjects  were  selected  and  retested  under  better  a- 
coustic  conditions  in  a  sound-proof  room.   Twelve  groups  of  20 
sounds  were  administered.   The  first  two  groups  were  provided  for 
training  purposes  and  lists  of  correct  responses  were  supplied. 
The  construction  of  both  Tone  and  Transient  test  recordings  was 
the  same. 

None  of  the  subjects  produced  results  which  entirely  contra- 
dicted the  predictions  made  above  but  in  only  two  cases  were  the 
results  unmistakably  favorable.   A  composite  graph  shown  in  Figure 

16,  in  which  results  were  not  averaged  directly,  was  constructed 
because  the  subjects  did  not  place  their  boundaries  at  the  same 
point  along  the  stimulus  range.   Thus  Figure  16  shows  the  aver- 
age gradient  at  the  boundaries  after  alignment.   The  average 
width  of  the  medium  category  has  been  drawn  to  approximately  the 
correct  proportion  and  shows  that  at  the  50  percent  level  it  was 
5.4  stimuli  for  transients  and  for  tones  6.6  stimuli  compared 
with  the  5  stimuli  demanded  in  the  initial  instructions.   The 
more  accurate  positioning  of  the  transient  category  boundaries 
was  reflected  in  all  the  subjects'  results. 

If  the  boundaries  of  Figure  16  are  inspected  between  the  20 
percent  and  80  percent  levels  the  differences  between  the  gradi- 
ents indicate  that  the  Transients  were  classified  more  precisely 
than  the  Tones.  However,  these  differences  are  not  as  marked  as 
those  demonstrated  for  the  vowels  and  stop  consonants  in  Figure 

17.  The  reason  for  this  may  be  in  differences  in  the  length  of 
training  or  possibly  it  may  arise  from  an  observation  made  by 
nearly  all  the  siibjects.   This  was  that  the  Transient  stimuli 
were  by  far  the  more  difficult  to  identify  because  it  was  very 
easy,  during  a  moment's  inattention,  to  miss  the  initial  transi- 
ent.  The  Tones  on  the  other  hand  allowed  considerably  more  time 
for  discrimination.   Comparison  of  the  conditions  for  Transient 
detection  with  those  for  stop  consonants  (described  by  Eimas  [6]) 
shows  that  the  listener  to  the  speech  sounds  could  be  utilizing 
the  short  voice  bar,  shown  immediately  preceding  the  second  for- 
mant  transition  in  Figure  17,  as  a  cue  to  the  time  of  onset  of 
the  transition.   Thus  his  attention  could  be  arrested  at  the  cri- 
tical time. 
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The  results  of  this  preliminary  experiment  do  not  allov;  a 
rejection  of  the  feedback  hypothesis.   There  are  still  a  number 
of  unanswered  questions  that  merit  further  investigation  but,  on 
the  basis  of  the  present  evidence,  the  balance  of  probability  must 
now  weigh  more  heavily  against  the  feedback  interpretation. 

CONCLUDING  REMARKS 

TaKing  the  three  experiments  together  the  overall  conclusion  is 
that  the  MDO  and  PAT  outputs  give  performances  which  are  signifi- 
cantly better  than  the  optophone  or  WO  signals.   Just  how  much 
this  improvement  represents  in  increased  reading  speed  is  an  im- 
portant question  but  difficult  to  answer  without  introducing 
some  gross  assumptions.   One  vital  piece  of  information  which  is 
lacking  is  the  speed  of  response  to  each  of  the  displays.   As 
the  Z  test  shows,  the  subjects  were  not  pressed  for  time  in  making 
their  responses.   Thus,  any  estimates  of  possible  reading  speeds 
made  at  this  stage  will  be  insubstantial.   In  this  context  it  is 
therefore  worth  warning  against  the  conclusion  that  an  output 
whose  performance  lies  midway  between  the  optophone  and  natural 
speech  will  necessarily  give  a  reading  speed  lying  midway  be- 
tween 20  and  200  wpm.   The  reason  for  this  may  be  obvious;  it  is 
that  the  sample  of  stimulus  material  used  in  the  experiments  has 
been  extremely  small.   If  the  number  of  stimuli  were  raised  from 
8  to  80  or  800,  the  relative  performance  from  the  liDO  and  PAT 
systems  might  slump  dramatically.   There  is  no  way  of  being  cer- 
tain of  this  except  to  perform  the  experiment.   At  this  point 
the  scale  of  the  undertaking  rapidly  becomes  too  large  to  be  con- 
ducted in  a  normal  laboratory  and  takes  on  the  status  of  a  field 
trial  with  a  likely  increase  in  the  amount  of  work  and  a  reduc- 
tion in  the  rate  at  which  information  is  gained.   The  alternative 
is  to  allow  the  direction  of  research  to  be  guided  by  intuition 
and  the  introspections  of  unbiased   subjects. 

Many  subjects  have  pointed  out,  quite  correctly,  that  the 
sounds  produced  by  the  PAT  device  represent  only  a  small  selec- 
tion cf  the  total  repertoire  found  in  human  speech.   In  terms  of 
some  n-dimensional  perceived  signal  space  (PSS)  the  PAT  sounds 
are  not  widely  separated  and  occupy  a  relatively  small  volume  of 
the  space.   The  subjects  usually  refer  to  the  fact  that  the  out- 
put fails  to  include  certain  vowels  and  consonants.   Despite  fur- 
ther developments  of  the  PAT  system  leading  to  the  generation 
of  a  more  multivariate  output  it  is  possible  that  introspective 
observations  may  still  yield  responses  stated  in  the  same  terms. 
This  may  arise  because  the  rapid  transitional  consonant-like 
sounds  that  the  device  will  produce  will  initially  compare  un- 
favorably with  familiar  English  consonants.   Subsequent  training 
would  be  expected  to  make  the  consonant-like  and  vowel-like  sounds 
equally  discriminable. 

However,  adults  have  only  a  limited  capacity  for  learning 
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completely  new  discriminations.   This  capacity  depends  in  a  com- 
plicated way  upon  various  incentives  and  the  learning  time  re- 
quired to  achieve  a  given  performance  level.   It  may  prove  neces- 
sary to  reduce  the  learning  time  by  making  the  machine  generate 
a  larger  number  of  identifiable  phonemes.   The  task  of  learning 
this  output  would  then  become  no  more  difficult  than  that  of 
learning  a  natural  language.   If  the  production  of  identifiable 
phonemes  should  prove  necessary  this  will  impose  quite  severe 
constraints  on  the  control  functions  and  it  will  not  be  possible 
to  generate  these  waveforms  from  print  without  a  considerable  in- 
crease in  processing  complexity.   From  this  point  there  is  but  a 
relatively  small  step  to  the  machine  providing  spoken  English. 
However  at  the  present  time  there  are  no  firm  indications  that 
these  problems  will  necessarily  arise. 

It  is  the  writer's  view  that  there  is  no  evidence  which  can 
contradict  the  conclusion  that  the  principal  factor  controlling 
the  efficiency  of  an  audible  display  is  its  dimensionality.   It 
also  appears  likely  that  there  is  no  intrinsic  reason  why,  given 
the  Scime  amount  of  training,  the  same  ultimate  reading  speed 
could  not  be  achieved  with  both  speech-like  (mimicable)  or  non- 
speech-like  multidimensional  outputs,  provided  that  they  do  not 
violate  certain  physical  resolution  limits  of  the  ear.   However, 
the  results  of  these  experiments  indicate  that,  if  the  output 
is  generated  by  a  speech  synthesizing  machine,  we  can  exploit 
some  of  the  discrimination  skills  already  acquired  in  natural 
speech  communication  and  achieve  better  performances.   However, 
there  is  some  evidence  that  the  PAT  can  emit  sounds  which  contra- 
dict certain  speech  habits  which  have  then  to  be  unlearned.   A 
good  example  of  this  is  the  case  when  a  sharp  pulse  of  noise,  in- 
dicating an  ascender,  occurs  in  the  middle  of  a  long  voiced  sound. 
The  impression  given  to  the  listener  is  that  the  sources  of  the 
voiced  sound  and  noise  pulse  are  quite  separate  and  he  is  inclined 
to  disregard  the  noise  as  interference.   Thus  the  reader  has  to  be 
prepared  for  combinations  of  sounds  which  do  not  occur  in  natural 
speech.   It  is  to  be  hoped  that  effects  of  this  sort  will  not 
prove  to  be  insurmountable  because  it  is  highly  probable  that  the 
avoidance  of  these  situations  by  generating  soxonds  which  make  even 
greater  use  of  acquired  speech  discrimination  skills,  can  only  be 
attained  at  the  cost  of  a  more  complicated  processing  logic.   This 
will  be  necessary  to  limit  the  range  of  controls  for  the  synthe- 
sizer to  the  form  required  to  produce  quasi  or  actual  phonemes. 

The  scale  of  this  research  effort  is  modest  and  there  are 
insufficient  resources  and  time  to  continue  to  investigate  deeply 
into  the  characteristics  of  a  wider  variety  of  multidimensional 
outputs.   These  experiments  have  shown  the  advantages  that  famil- 
iarity confers  on  speech-like  outputs  compared  with  nonspeech  dis- 
plays and,  despite  the  possible  difficulties,  the  prospects  of 
success  seem  to  favor  speech-like  outputs.   The  present  stage  of 
development  of  the  PAT  system  is  rudimentary  and  there  are  a  num- 
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ber  of  modifications  which  could  be  made  which  would  lead  to  a 
more  varied  repertoire  of  sounds,  and  (probably)  better  perfor- 
mances in  learning  trials  similar  to  those  described  here.   With 
all  reading  machines,  at  some  point  the  decision  has  to  be  made 
whether  further  development  based  upon  introspective  observations 
should  be  halted  in  favor  of  carrying  out  more  realistic  trials 
with  larger  vocabularies.   While  it  is  too  early  yet  to  proceed 
to  this  stage  with  the  PAT  system,  large  scale  evaluation  of  out- 
put along  the  lines  adopted  with  the  Battelle  reader  should  not 
be  delayed  for  too  long,  for  it  will  provide  the  only  convincing 
measure  of  feasibility. 


APPENDIX  A: 

The  basic  features  of  Fournier  d'Albe's  optophone  are  illustrated 
in  Figure  18.   A  rectangular  slit  image  of  the  source  is  focused 
with  the  long  axis  along  the  radius  of  a  rotating  sectored  disc 
at  D,      Six  annular  rings  on  the  disc  modulate  the  transmitted 
light  at  different  frequencies  lying  in  the  audible  range.   The 
lens  Lp  focuses  this  light  to  form  on  the  printed  page  an  image 
of  the  slit  at  D.      Reflected  light  from  the  page  falls  onto  the 
photocell  Pj   while  a  portion  of  the  transmitted  light  from  D    falls 
onto  a  second  photocell  P,.   Both  signals  are  fed  to  a  difference 
unit  which  is  adjusted  so  that  when  white  paper  is  placed  in  the 
reading  position  the  signals  balance  and  no  output  is  heard. 
When  a  black  region  of  a  character  falls  under  a  portion  of  the 
illuminated  slit  the  signals  arriving  at  the  difference  unit  be- 
come unbalanced  and  a  tone,  corresponding  to  the  position  of  the 
black  segment,  is  heard  in  the  earphone. 

A  modern  development  of  the  optophone  dispenses  with  the 
scanning  disc  and  in  its  place  uses  a  column  of  six  or  more  photo- 
cells.  Each  cell  is  connected  via  an  electronic  switch  to  an  os- 
cillator tuned  to  one  of  a  range  of  frequencies  spaced  at  equal 
logarithmic  intervals  through  the  audio-spectrum.   Following  the 
convention  of  the  Fournier  d'Albe  optophone  the  highest  frequency 
is  switched  on  by  the  top  cell  of  the  column  and  the  lowest  fre- 
quency by  the  bottom  cell.   In  the  case  of  Battelle  reader  output,^ 
which  was  simulated  in  these  experiments,  an  oscillator  is  switched 
on  when  one-third  or  more  of  a  cell  is  covered  by  a  character  seg- 
ment.  The  discrete  switching  action  gives  the  output  a  staccato 
quality  which  contrasts  sharply  with  the  more  smoothly  varying 
display  of  the  original  optophone. 
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APPENDIX  B: 

The  scanning  system  which  transforms  print  into  digital  forrr.  is 
shown  in  Figure  19.   A  negative  copy  of  the  printed  text  is  sited 
at  the  focus  T'l   between  the  beam  splitter  B    and  the  lens  L  ^    and 
is  transported  horizontally.   The  lamp  filament  is  focused'^onto 
the  circular  aperture  S,    through  a  vertical  slit  T.      S   is  focused 
onto  a  chopper  at  S^    having  36  blades  and  the  slit  T   onto  the 
scanning  disc  at  T^    containing  16  radial  slits.   The  scanning 
disc  in  its  turn  is  focused  onto  the  film  strip  by  L^    and  the 
transmitted  light  collected  by  the  lens  Lr   to  fall  on  the  photo 
transistor  P^.   This  cell  therefore  provides  a  signal  which  de- 
pends upon  the  presence  or  absence  of  a  character  on  the  film 
strip.   The  chopper,  scanning  disc,  and  film  transport  are  all 
synchronized  by  a  system  of  gears  driven  by  a  single  electric  m.o- 
tor.   The  two  other  photocells.  Tape  Step  and  End  of  Scan,  are 
used  to  signal  the  sampling  rate  and  the  end  of  a  scan.   The  hori- 
zontal slit  R   is  placed  so  that  only  the  last  sample  of  each  scan 
falls  on  the  photocell  Pg.   Finally  the  signals  from  these  three 
cells  are  fed,  via  trigger  circuits  into  a  paper  tape  punch  and 
the  output  subsequently  processed  by  the  feature  analysis  program. 
On  average  the  width  of  a  character  is  between  five  and  six  scans 
and  each  scan  contains  twelve  1-bit  samples  giving  a  matrix  of 
between  60  and  70  bits  per  character. 


APPENDIX  C: 

With  the  exception  of  the  Parametric  Artificial  Talking  Apparatus 
which  was  designed  by  Lawrence  (11)  all  the  circuitry  used  in  the 
stimulus  generator  was  built  in  the  laboratory. 

The  digital-analogue  (D-A)  converted  and  demultiplexing  unit 
is  shown  in  Figures  20,  21,  and  22.   Figure  20  shows  a  schematic 
diagrcim  of  the  connections  between  the  D-A  converter,  the  six 
clamp  amplifiers,  and  the  eight  channel  selector  box.   The  output 
from  D-A  converter  is  fed  to  all  the  clamp  amplifier  inputs.   Sig- 
nals from  the  data  selector  (set  for  six  channels)  set  each  clamp 
amplifier  in  turn  to  the  value  appearing  at  its  input.   The  count- 
er of  the  data  selector  is  maintained  in  phase  by  means  of  a 
resetting  pulse  derived  from  the  eighth  hole  (H)  of  the  paper 
tape.   Figure  21  shows  the  D-A  converter  and  a  clamp  amplifier. 
Figure  22  shows  a  semi-schematic  diagram  of  the  eight  channel  data 
selector  unit. 

Figure  23  and  24  show  the  circuits  of  the  counter  and  tape 
sequencing  unit.   This  device  is  used  in  conjunction  with  two 
tape  recorders.   The  first  machine  carries  an  endless  loop  of 
tape  on  which  the  series  of  stimulus  sounds  has  been  recorded  on 
track  one  and  a  series  of  short  pulses  at  100  0  cps ,  preceding  the 
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sounds,  recorded  on  track  two.   The  signal  pulse  from  track  two 
triggers  the  circuit  shown  in  Figure  23  and  the  output  is  fed  in- 
to the  counting  circuit  shown  in  Figure  24.   The  sequencing  unit 
compares  the  number  registered  on  the  counter  with  one  of  a  se- 
quence of  numbers  punched  on  paper  tape  and  when  the  numbers  co- 
incide the  device  switches  on  the  second  recorder  which  then  re- 
cords the  stimulus  sound.   The  device  therefore  assembles  se- 
quences of  sounds  drawn  from  the  loop  of  tape  in  any  order  speci- 
fied on  the  input  paper  tape. 

Figures  25,  26,  and  27  are  circuit  diagrams  of  some  other  de- 
vices mentioned  in  this  report.   Mr.  D.L.A.  Barber,  Mr.  J.R.  Parks, 
and  Mr.  E.P.H.  Woodroff  have  been  responsible  for  the  design  of 
much  of  the  equipment  described  here. 

APPENDIX  D:   CONFUSION  MATRICES 

Conventional  Optophone 

Stimuli  Presented 


1 

284 

10 

3 

11 

4 

2 

2 

3 

2 

19 

259 

2 

35 

0 

1 

1 

4 

3 

1 

1 

275 

1 

12 

15 

2 
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Responses  4 

29 

37 

2 

185 

4 

4 

0 

2 

5 

6 

2 

4 

6 

168 

35 

4 

31 

6 

4 

2 

9 

1 

24 
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APPENDIX  D: 


(continued) 
Compressed  Optophone 
Stimuli  Presented 
B      C      D      E 


A 

388 

12 

2 

2 

0 

0 

0 

4 

B 

16 

276 

0 

1 

3 

3 

3 

2 

C 

0 

0 

315 

0 

1 

5 

0 

1 

Responses  D 

1 

1 

0 

245 

0 

0 

0 

0 

E 

0 

0 

1 

0 

223 

17 

3 

6 

F 

0 

0 

8 

0 

21 

293 

3 

5 

G 

1 

9 

2 

1 

6 

10 

261 

4 

H 

3 

1 

11 

0 

10 

7 

3 

237 

Group  1  (Similar  Sounds) 
Stimuli  Presented 
12      3      4      5 


1 

401 

14 

5 

2 

11 

0 

0 

0 

2 

5 

360 

4 

4 

1 

0 

0 

0 

3 

9 

6 

379 

13 

4 

0 

0 

0 

Responses 

4 

7 

1 

6 

276 

3 

0 

0 

0 

5 

1 

1 

0 

4 

307 

4 

2 

0 

6 

0 

0 

0 

0 

0 

382 

1 

13 

7 

2 

0 

0 

0 

7 

5 

317 

2 

8 

0 

0 

0 

0 

1 

11 

1 

282 
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APPENDIX  D:   (continued) 

Group  2  (Dissimilar  Sounds) 
Stimuli  Presented 
A     B     C     D     E     F 


A 

327 

0 

1 

0 

60 

0 

0 

0 

B 

0 

344 

0 

11 

0 

1 

0 

5 

C 

1 

0 

401 

0 

5 

0 

3 

0 

Responses  D 

1 

24 

0 

280 

0 

2 

0 

2 

E 

94 

0 

5 

2 

246 

0 

1 

0 

F 

0 

0 

0 

0 

0 

398 

0 

1 

G 

9 

1 

0 

2 

10 

2 

330 

0 

H 

0 

1 

0 

1 

0 

3 

0 

287 

Group  3  (Optophone  Sounds) 
Stimuli  Presented 


s 

T 

U 

V 

W 

X 

Y 

Z 

S 

396 

2 

2 

0 

3 

1 

1 

5 

T 

3 

374 

1 

5 

2 

2 

4 

1 

U 

12 

0 

316 

3 

0 

3 

3 

28 

Responses  V 

1 

1 

1 

264 

26 

46 

36 

3 

W 

0 

2 

3 

21 

204 

101 

65 

11 

X 

3 

1 

0 

15 

27 

174 

66 

2 

Y 

1 

2 

1 

15 

51 

75 

99 

3 

Z 

11 

0 

70 

2 

13 

1 

2 

242 
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APPENDIX  D:   (continued) 

PAT  Output 
Stimuli  Presented 
A      B      C      D      E 


A 

392 

6 

1.6 

4 

15 

1 

0 

0 

B 

4 

313 

5 

10 

11 

5 

4 

2 

C 

14 

1 

346 

3 

16 

1 

1 

0 

Responses  D 

1 

2 

2 

216 

17 

1 

1 

3 

E 

8 

24 

31 

48 

244 

3 

8 

0 

F 

2 

3 

2 

5 

6 

371 

10 

7 

G 

2 

3 

0 

3 

0 

8 

284 

3 

H 

0 

1 

1 

0 

0 

13 

18 

294 

MDO  Output 
Stimuli  Presented 
J      K      L     M     N 


I 

315 

57 

1 

9 

4 

1 

0 

0 

J 

44 

286 

0 

5 

1 

15 

3 

0 

K 

1 

0 

361 

7 

3 

0 

0 

0 

Responses  L 

26 

2 

18 

260 

27 

1 

0 

0 

M 

12 

3 

4 

5 

295 

1 

0 

0 

N 

3 

32 

0 

2 

1 

358 

27 

0 

0 

4 

0 

0 

0 

0 

27 

270 

4 

P 

11 

1 

0 

0 

0 

0 

22 

292 
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APPENDIX  D:   (continued) 

WO  Output 
Stimuli  Presented 
S      T      U     V     W 


s 

231 

21 

14 

7 

31 

29 

18 

8 

T 

63 

340 

2 

3 

2 

15 

1 

1 

U 

26 

0 

341 

10 

13 

9 

4 

5 

Responses  V 

10 

1 

7 

278 

8 

4 

9 

8 

W 

35 

0 

25 

4 

248 

20 

10 

8 

X 

25 

15 

2 

8 

6 

268 

37 

34 

Y 

13 

2 

1 

9 

7 

34 

118 

80 

Z 

4 

3 

0 

4 

1 

15 

75 

145 

SUMMARY 

It  is  intended  that  this  report  should  draw  together  the  most  im- 
portant results  of  the  work  carried  out  during  the  three  year 
period  from  November  1961.   The  particular  topic  of  this  study 
forms  only  one  of  a  number  of  problems  requiring  attention  before 
a  practical  reading  device  can  be  built.   At  the  time  when  this 
research  began  the  general  problem  of  coding  auditory  information 
to  achieve  higher  reading  speeds  appeared  to  be  the  most  important. 
Some  definite  progress  has  been  made  in  this  field  and  the  pro- 
spects of  further  improvements  are  good,  but  the  investigators 
have,  during  the  course  of  the  research,  become  aware  of  a  num- 
ber of  factors  which  might  make  future  progress  more  difficult. 
The  report  discusses  a  number  of  these  questions.   In  this  con- 
text the  writer  particularly  wishes  to  acknowledge  many  helpful 
discussions  with  colleagues  and  members  of  the  St.  Dunstan's 
Scientific  Committee. 
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